An Approach to Web-Scale Named-Entity Disambiguation

  • Authors:
  • Luís Sarmento;Alexander Kehlenbeck;Eugénio Oliveira;Lyle Ungar

  • Affiliations:
  • Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;Google Inc, USA;Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;University of Pennsylvania - CS, Philadelphia, USA

  • Venue:
  • MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data. This effect limits the potential benefits for data-driven approaches of processing larger data-sets, and suggests that efficient clustering-based disambiguation methods for the web will require extracting more specialized information from documents.