Knowledge Extraction from Bilingual Corpora

  • Authors:
  • Harold L. Somers

  • Affiliations:
  • -

  • Venue:
  • Information Extraction: Towards Scalable, Adaptable Systems
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

The use of corpora has become an important issue in IE. In this chapter we consider a specific type of corpus, the bilingual parallel corpus, and ways of automatically extracting information from such corpora. This information, "linguistic metaknowledge", is essential for techniques such as tokenization, POS-tagging, morphological analysis, used in IE. Where we wish to extract information from multilingual texts, we must rely on these linguistic resources being available in several languages. This chapter discusses locating and storing parallel texts, alignment at various levels (sentence, word, phrase), and extraction of bilingual vocabulary and terminology.