An open-source toolkit for mining Wikipedia

  • Authors:
  • David Milne;Ian H. Witten

  • Affiliations:
  • Computer Science Department, The University of Waikato, Private Bag 3105, Hamilton, New Zealand;Computer Science Department, The University of Waikato, Private Bag 3105, Hamilton, New Zealand

  • Venue:
  • Artificial Intelligence
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. For developers and researchers it represents a giant multilingual database of concepts and semantic relations, a potential resource for natural language processing and many other research areas. This paper introduces the Wikipedia Miner toolkit, an open-source software system that allows researchers and developers to integrate Wikipedia@?s rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipedia@?s content and structure, and includes a Java API to provide access to them. Wikipedia@?s articles, categories and redirects are represented as classes, and can be efficiently searched, browsed, and iterated over. Advanced features include parallelized processing of Wikipedia dumps, machine-learned semantic relatedness measures and annotation features, and XML-based web services. Wikipedia Miner is intended to be a platform for sharing data mining techniques.