Word Sense Disambiguation of Czech Texts

  • Authors:
  • Ondrej Cikhart;Jan Hajic

  • Affiliations:
  • -;-

  • Venue:
  • TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This contribution refers to the project of BYLL Software Ltd. that uses human aided WSD for the annotation of a fulltext database of the Czech law system named ASPI. We used about 3 mil. words of annotated texts from the law system of the Czech Republic since the 60's. The annotated law corpus provides certain text regularity, but at the same time it covers wide range of subjects. The goal has been to save as much of the human intervention during text indexing as possible, measured by the number of queries posed to the human annotator, whilst retaining truly minimal error rate (∼0.5 %) in the automatically disambiguated cases. A combination of Naive Bayes, Decision Lists and (minimal number) of manually written rules has been used. The statistical methods showed up to be appropriate for our intention. The results show that we have saved 80% of queries to the human annotator, which proved to be enough to warrant the inclusion of the software into a production system.