Using word sense discrimination on historic document collections

  • Authors:
  • Nina Tahmasebi;Kai Niklas;Thomas Theuerkauf;Thomas Risse

  • Affiliations:
  • L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany

  • Venue:
  • Proceedings of the 10th annual joint conference on Digital libraries
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today's language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785 - 1985.