Enhancing cross document coreference of web documents with context similarity and very large scale text categorization

Authors:
Jian Huang;Pucktada Treeratpituk;Sarah M. Taylor;C. Lee Giles
Affiliations:
Pennsylvania State University;Pennsylvania State University;Lockheed Martin IS&GS;Pennsylvania State University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 17
Cited 0

Random Forests

Machine Learning
A family of additive online algorithms for category ranking

The Journal of Machine Learning Research
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Information Extraction Tools: Deciphering Human Language

IT Professional
Identifying anaphoric and non-anaphoric noun phrases to improve coreference resolution

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Person resolution in person search results: WebHawk

Proceedings of the 14th ACM international conference on Information and knowledge management
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Weakly supervised learning for cross-document person name disambiguation supported by information extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
On updates that constrain the features' connections during learning

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Who is who and what is what: experiments in cross-document co-reference

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Named entity disambiguation by leveraging wikipedia semantic knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Profile based cross-document coreference using kernelized fuzzy relational clustering

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross Document Coreference (CDC) is the task of constructing the coreference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as coreference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite coreference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the coreference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.