Clustering Wikipedia infoboxes to discover their types

Authors:
Thanh Hoang Nguyen;Huong Dieu Nguyen;Viviane Moreira;Juliana Freire
Affiliations:
University of Utah, Salt Lake City, UT, USA;University of Utah, Salt Lake City, UT, USA;UFRGS-Brazil, Porto Alegre, Brazil;New York University - Poly, New York, NY, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 10
Cited 0

Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modern Information Retrieval

Modern Information Retrieval
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Holistic Query Interface Matching using Parallel Schema Matching

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
DBpedia - A crystallization point for the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
Querying Wikipedia documents and relationships

Procceedings of the 13th International Workshop on the Web and Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.