Web-scale named entity recognition

  • Authors:
  • Casey Whitelaw;Alex Kehlenbeck;Nemanja Petrovic;Lyle Ungar

  • Affiliations:
  • Google, New York, NY, USA;Google, New York, NY, USA;Google, New York, NY, USA;University of Pennsylvania, Philadelphia, PA, USA

  • Venue:
  • Proceedings of the 17th ACM conference on Information and knowledge management
  • Year:
  • 2008

Quantified Score

Hi-index 0.02

Visualization

Abstract

Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online classification training method that learns to recognize not only high level categories such as place and person, but also more fine-grained categories such as soccer players, birds, and universities. The resulting system gives precision and recall performance comparable to that obtained for more limited entity types in much more structured domains such as company recognition in newswire, even though web documents often lack consistent capitalization and grammatical sentence construction.