Lexical Attraction for Text Compression

  • Authors:
  • Joscha Bach;Ian H. Witten

  • Affiliations:
  • -;-

  • Venue:
  • DCC '99 Proceedings of the Conference on Data Compression
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual next symbol is encoded with respect to this distribution. But the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on "wh"-words.To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are.1 This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acyclic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model.