Clustering Wikipedia infoboxes to discover their types

  • Authors:
  • Thanh Hoang Nguyen;Huong Dieu Nguyen;Viviane Moreira;Juliana Freire

  • Affiliations:
  • University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;UFRGS-Brazil, Porto Alegre, Brazil;New York University - Poly, New York, NY, USA

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.