Abstract
Over the past 25 years, the practical requirement for efficient data compression algorithms has generated a large volume of research covering the whole spectrum from practically implementable algorithms to deep theoretical results. One prominent example is the Lempel-Ziv algorithm for lossless data compression: Not only is it implemented on most computers used today, but also, attempts to analyze its performance have provided new problems in probability, information theory and ergodic theory, whose solutions reveal a series of interesting results about the entropy and the recurrence structure of stationary processes.
The main problems considered in this thesis are those of determining the asymptotic behavior of waiting times and recurrence times in stationary processes. These questions are motivated primarily by their important applications in data compression and the analysis of string matching algorithms in DNA sequence analysis. In particular, solving the waiting times problem also allowed us to solve a long-standing open problem in data compression: That of finding a practical extension of the Lempel-Ziv coding algorithm for lossy compression.
This thesis is divided into three parts. In the first part we generalize one of the central theoretical results in source coding theory: We prove a natural generalization of the celebrated Shannon-McMillan-Breiman theorem (as well as its subsequent refinements by Ibragimov and by Philipp and Stout) for real-valued processes and for the case when distortion is allowed. These results are inspired by, and provide the key technical ingredient in, our asymptotic analysis of recurrence and waiting times, in the second part. The main probabilistic tools used in establishing them are uniform almost-sure approximation, powerful techniques from large deviations, and classical second-moment blocking arguments.
In the second part we consider the problem of waiting times between stationary processes. We show that waiting times grow exponentially with probability one and, that their rate is given by the solution to an explicit variational problem in terms of the entropies of the underlying processes. Moreover, we show that, properly scaled, the deviations of the waiting times from their limiting exponent are asymptotically Gaussian (with a limiting variance explicitly identified), and we prove finer theorems (e.g., a law of the iterated logarithm and an almost sure invariance principle) that provide the exact rate of convergence in the above limit theorems. Corresponding results are proved for recurrence times, and dual results are stated and proved for certain longest-match lengths between stationary processes.
Finally, in the third part, we use the insight gained by the waiting times results to find a practical extension of the Lempel-Ziv scheme for the case of lossy data compression. We propose a new lossy version of the so-called Fixed-Database Lempel-Ziv coding algorithm, which is of complexity ``comparable'' to that of the corresponding lossless scheme, and we prove that its compression performance is (asymptotically) optimal.