Universal compression of memoryless sources over unknown alphabets

Authors:
A. Orlitsky;N. P. Santhanam;Junan Zhang
Affiliations:
Dept. of Electr. & Comput. Eng., Univ. of California, La Jolla, CA, USA;-;-
Venue:
IEEE Transactions on Information Theory
Year:
2006

Citing 0
Cited 5

On modeling profiles instead of values

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Estimating Entropy Rates with Bayesian Confidence Intervals

Neural Computation
A lower bound on compression of unknown alphabets

Theoretical Computer Science
Coding on countably infinite alphabets

IEEE Transactions on Information Theory
Connections between probability estimation and graph theory

Allerton'09 Proceedings of the 47th annual Allerton conference on Communication, control, and computing

Quantified Score

Hi-index	754.90

Visualization

Abstract

It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern-the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.