What's in a name?: an unsupervised approach to link users across communities

Authors:
Jing Liu;Fan Zhang;Xinying Song;Young-In Song;Chin-Yew Lin;Hsiao-Wuen Hon
Affiliations:
Harbin Institute of Technology, Harbin, China;Nankai University, Tianjin, China;Microsoft Research, Redmond, WA, USA;NHN Corporation, Seoul, South Korea;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 23
Cited 1

Digital Image Processing

Digital Image Processing
A machine learning approach to coreference resolution of noun phrases

Computational Linguistics - Special issue on computational anaphora resolution
Anti-aliasing on the web

Proceedings of the 13th international conference on World Wide Web
You are what you say: privacy risks of public mentions

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography

Proceedings of the 16th international conference on World Wide Web
Can pseudonymity really guarantee privacy?

SSYM'00 Proceedings of the 9th conference on USENIX Security Symposium - Volume 9
Web People Search via Connection Analysis

IEEE Transactions on Knowledge and Data Engineering
Understanding the value of features for coreference resolution

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A Framework for Computing the Privacy Scores of Users in Online Social Networks

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
TwitterRank: finding topic-sensitive influential twitterers

Proceedings of the third ACM international conference on Web search and data mining
Myths and fallacies of "Personally Identifiable Information"

Communications of the ACM
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
End-to-end coreference resolution via hypergraph partitioning

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Competition-based user expertise score estimation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Combining machine learning and human judgment in author disambiguation

Proceedings of the 20th ACM international conference on Information and knowledge management
Interweaving public user profiles on the web

UMAP'10 Proceedings of the 18th international conference on User Modeling, Adaptation, and Personalization
Resolving user identities over social networks through supervised learning and rich similarity features

Proceedings of the 27th Annual ACM Symposium on Applied Computing
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
Studying User Footprints in Different Online Social Networks

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)

We know how you live: exploring the spectrum of urban lifestyles

Proceedings of the first ACM conference on Online social networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of linking users across multiple online communities. Specifically, we focus on the alias-disambiguation step of this user linking task, which is meant to differentiate users with the same usernames. We start quantitatively analyzing the importance of the alias-disambiguation step by conducting a survey on 153 volunteers and an experimental analysis on a large dataset of About.me (75,472 users). The analysis shows that the alias-disambiguation solution can address a major part of the user linking problem in terms of the coverage of true pairwise decisions (46.8%). To the best of our knowledge, this is the first study on human behaviors with regards to the usages of online usernames. We then cast the alias-disambiguation step as a pairwise classification problem and propose a novel unsupervised approach. The key idea of our approach is to automatically label training instances based on two observations: (a) rare usernames are likely owned by a single natural person, e.g. pennystar88 as a positive instance; (b) common usernames are likely owned by different natural persons, e.g. tank as a negative instance. We propose using the n-gram probabilities of usernames to estimate the rareness or commonness of usernames. Moreover, these two observations are verified by using the dataset of Yahoo! Answers. The empirical evaluations on 53 forums verify: (a) the effectiveness of the classifiers with the automatically generated training data and (b) that the rareness and commonness of usernames can help user linking. We also analyze the cases where the classifiers fail.