Solving the "Who's Mark Johnson" puzzle: information extraction based cross document coreference

  • Authors:
  • Jian Huang;Sarah M. Taylor;Jonathan L. Smith;Konstantinos A. Fotiadis;C. Lee Giles

  • Affiliations:
  • Pennsylvania State University, University Park, PA;Lockheed Martin IS&GS, Arlington, VA;Lockheed Martin IS&GS, Arlington, VA;Lockheed Martin IS&GS, King of Prussia, PA;Pennsylvania State University, University Park, PA

  • Venue:
  • SRWS '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cross Document Coreference (CDC) is the problem of resolving the underlying identity of entities across multiple documents and is a major step for document understanding. We develop a framework to efficiently determine the identity of a person based on extracted information, which includes unary properties such as gender and title, as well as binary relationships with other named entities such as co-occurrence and geo-locations. At the heart of our approach is a suite of similarity functions (specialists) for matching relationships and a relational density-based clustering algorithm that delineates name clusters based on pairwise similarity. We demonstrate the effectiveness of our methods on the WePS benchmark datasets and point out future research directions.