Efficient topic-based unsupervised name disambiguation

Authors:
Yang Song;Jian Huang;Isaac G. Councill;Jia Li;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 20
Cited 37

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Collaborative filtering via gaussian probabilistic latent semantic analysis

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Two supervised learning approaches for name disambiguation in author citations

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Web usage mining based on probabilistic latent semantic analysis

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A Bayesian Hierarchical Model for Learning Natural Scene Categories

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Discovering user access pattern based on probabilistic latent factor model

ADC '05 Proceedings of the 16th Australasian database conference - Volume 39
Discovering Objects and their Localization in Images

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Learning Hierarchical Models of Scenes, Objects, and Parts

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Improved annotation of the blogosphere via autotagging and hierarchical clustering

Proceedings of the 15th international conference on World Wide Web
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Using web information for creating publication venue authority files

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Named entity normalization in user generated content

Proceedings of the second workshop on Analytics for noisy unstructured text data
Keeping a digital library clean: new solutions to old problems

Proceedings of the eighth ACM symposium on Document engineering
Author Name Disambiguation for Citations Using Topic and Web Correlation

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
On co-authorship for author disambiguation

Information Processing and Management: an International Journal
Alleviating the Problem of Wrong Coreferences in Web Person Search

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Author name disambiguation in MEDLINE

ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using web information for author name disambiguation

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Clustering technique in multi-document personal name disambiguation

ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Latent Topic Extraction from Relational Table for Record Matching

DS '09 Proceedings of the 12th International Conference on Discovery Science
SyGAR: a synthetic data generator for evaluating name disambiguation methods

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Effective self-training author name disambiguation in scholarly digital libraries

Proceedings of the 10th annual joint conference on Digital libraries
Disambiguating identity web references using Web 2.0 data and semantics

Web Semantics: Science, Services and Agents on the World Wide Web
Citation author topic model in expert search

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments

Journal of the American Society for Information Science and Technology
Construction of a large-scale test set for author disambiguation

Information Processing and Management: an International Journal
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Resolving author name homonymy to improve resolution of structures in co-author networks

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Event detection with spatial latent Dirichlet allocation

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Efficient name disambiguation in digital libraries

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Unsupervised name ambiguity resolution using a generative model

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Author disambiguation using multi-aspect similarity indicators

Scientometrics
Disambiguating authors in citations on the web and authorship correlations

Expert Systems with Applications: An International Journal
Cost-effective on-demand associative author name disambiguation

Information Processing and Management: an International Journal
A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Information Sciences: an International Journal
Active associative sampling for author name disambiguation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Expertise Retrieval

Foundations and Trends in Information Retrieval
A brief survey of automatic methods for author name disambiguation

ACM SIGMOD Record
Author disambiguation using wikipedia-based explicit semantic analysis

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Ambiguous author query detection using crowdsourced digital library annotations

Information Processing and Management: an International Journal
A relevance feedback approach for the author name disambiguation problem

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Resolving homonymy with correlation clustering in scholarly digital libraries

Proceedings of the 22nd international conference on World Wide Web companion
Towards a fair comparison between name disambiguation approaches

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Academic network analysis: a joint topic modeling approach

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Robust hybrid name disambiguation framework for large databases

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.