Extracting Geospatial Entities from Wikipedia

  • Authors:
  • Jeremy Witmer;Jugal Kalita

  • Affiliations:
  • -;-

  • Venue:
  • ICSC '09 Proceedings of the 2009 IEEE International Conference on Semantic Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the challenge of extracting geospatial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vector Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high incidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100\%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not immediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sentence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82\%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovisualization of Wikipedia.