Entity classification by bag of Wikipedia articles

  • Authors:
  • Tomáš Kliegr

  • Affiliations:
  • University of Economics, Prague, Prague, Czech Rep

  • Venue:
  • PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The input for a Bag-of-Articles (BOA) classifier is a set of unlabeled entities - noun chunks and a set of target labeled entities - Wikipedia articles. The classifier locates Wikipedia articles that might define the unlabeled entity and performs disambiguation selecting one. Both unlabeled and labeled entity is represented with the proposed BOA term weight vector, which is created by aggregating term weight vectors of articles related to the Wikipedia article defining it. The label is assigned by choosing the closest labeled entity, also a BOA term weight vector, with cosine similarity. The paper formally defines the BOA entity representation and BOA-based entity classification and presents a partial software implementation. A BOA-based disambiguation algorithm is presented as a planned extension.