Lexical Attraction for Text Compression

Authors:
Joscha Bach;Ian H. Witten
Affiliations:
-;-
Venue:
DCC '99 Proceedings of the Conference on Data Compression
Year:
1999

Citing 0
Cited 3

Supporting Textual Input by Using Multiple Entropy Models

Fundamenta Informaticae
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal
Supporting Textual Input by Using Multiple Entropy Models

Fundamenta Informaticae

Quantified Score

Hi-index	0.00

Visualization

Abstract

The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual next symbol is encoded with respect to this distribution. But the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on "wh"-words.To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are.1 This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acyclic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model.