Cross-lingual entity matching and infobox alignment in Wikipedia

Authors:
Daniel Rinser;Dustin Lange;Felix Naumann
Affiliations:
Hasso Plattner Institute, Potsdam, Germany;Hasso Plattner Institute, Potsdam, Germany;Hasso Plattner Institute, Potsdam, Germany
Venue:
Information Systems
Year:
2013

Citing 13
Cited 1

A simple approximation algorithm for the weighted matching problem

Information Processing Letters
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Ontology Matching

Ontology Matching
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Information arbitrage across multi-lingual Wikipedia

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Measuring self-focus bias in community-maintained knowledge repositories

Proceedings of the fourth international conference on Communities and technologies
Cross-lingual alignment and completion of Wikipedia templates

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Cross-lingual semantic relatedness using encyclopedic knowledge

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Untangling the cross-lingual link structure of Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
MENTA: inducing multilingual taxonomies from wikipedia

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Filling the gaps among DBpedia multilingual chapters for question answering

Proceedings of the 5th Annual ACM Web Science Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever-increasing number of articles feature so-called infoboxes, which provide factual information about the articles' subjects. As the different language versions evolve independently, they provide different information on the same topics. Correspondences between infobox attributes in different language editions can be leveraged for several use cases, such as automatic detection and resolution of inconsistencies in infobox data across language versions, or the automatic augmentation of infoboxes in one language with data from other language versions. We present an instance-based schema matching technique that exploits information overlap in infoboxes across different language editions. As a prerequisite we present a graph-based approach to identify articles in different languages representing the same real-world entity using (and correcting) the interlanguage links in Wikipedia. To account for the untyped nature of infobox schemas, we present a robust similarity measure that can reliably quantify the similarity of strings with mixed types of data. The qualitative evaluation on the basis of manually labeled attribute correspondences between infoboxes in four of the largest Wikipedia editions demonstrates the effectiveness of the proposed approach.