Data Compression

Debra A. Lelewer and Daniel S. Hirschberg

Table of Contents

Abstract
INTRODUCTION
1. FUNDAMENTAL CONCEPTS
1.1 Definitions
1.2 Classification of Methods
1.3 A Data Compression Model
1.4 Motivation
2. SEMANTIC DEPENDENT METHODS
3. STATIC DEFINED-WORD SCHEMES
3.1 Shannon-Fano Coding
3.2 Static Huffman Coding
3.3 Universal Codes and Representations of the Integers
3.4 Arithmetic Coding
4. ADAPTIVE HUFFMAN CODING
4.1 Algorithm FGK
4.2 Algorithm V
5. OTHER ADAPTIVE METHODS
5.1 Lempel-Ziv Codes
5.2 Algorithm BSTW
6. EMPIRICAL RESULTS
7. SUSCEPTIBILITY TO ERROR
7.1 Static Codes
7.2 Adaptive Codes
8. NEW DIRECTIONS
9. SUMMARY
REFERENCES

Abstract

This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems.

Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested.

INTRODUCTION

Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of efficient coding and its consequences, in the form of speed of transmission and probability of error [Ingels 1971]. Data compression may be viewed as a branch of information theory in which the primary objective is to minimize the amount of data to be transmitted. The purpose of this paper is to present and analyze a variety of data compression algorithms.

A simple characterization of data compression is that it involves transforming a string of characters in some representation (such as ASCII) into a new string (of bits, for example) which contains the same information but whose length is as small as possible. Data compression has important application in the areas of data transmission and data storage. Many data processing applications require storage of large volumes of data, and the number of such applications is constantly increasing as the use of computers extends to new disciplines. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. Compressing data to be stored or transmitted reduces storage and/or communication costs. When the amount of data to be transmitted is reduced, the effect is that of increasing the capacity of the communication channel. Similarly, compressing a file to half of its original size is equivalent to doubling the capacity of the storage medium. It may then become feasible to store the data at a higher, thus faster, level of the storage hierarchy and reduce the load on the input/output channels of the computer system.

Many of the methods to be discussed in this paper are implemented in production systems. The UNIX utilities compact and compress are based on methods to be discussed in Sections 4 and 5 respectively [UNIX 1984]. Popular file archival systems such as ARC and PKARC employ techniques presented in Sections 3 and 5 [ARC 1986; PKARC 1987]. The savings achieved by data compression can be dramatic; reduction as high as 80% is not uncommon [Reghbati 1981]. Typical values of compression provided by compact are: text (38%), Pascal source (43%), C source (36%) and binary (19%). Compress generally achieves better compression (50-60% for text such as source code and English), and takes less time to compute [UNIX 1984]. Arithmetic coding (Section 3.4) has been reported to reduce a file to anywhere from 12.1 to 73.5% of its original size [Witten et al. 1987]. Cormack reports that data compression programs based on Huffman coding (Section 3.2) reduced the size of a large student-record database by 42.1% when only some of the information was compressed. As a consequence of this size reduction, the number of disk operations required to load the database was reduced by 32.7% [Cormack 1985]. Data compression routines developed with specific applications in mind have achieved compression factors as high as 98% [Severance 1983].

While coding for purposes of data security (cryptography) and codes which guarantee a certain level of data integrity (error detection/correction) are topics worthy of attention, these do not fall under the umbrella of data compression. With the exception of a brief discussion of the susceptibility to error of the methods surveyed (Section 7), a discrete noiseless channel is assumed. That is, we assume a system in which a sequence of symbols chosen from a finite alphabet can be transmitted from one point to another without the possibility of error. Of course, the coding schemes described here may be combined with data security or error correcting codes.

Much of the available literature on data compression approaches the topic from the point of view of data transmission. As noted earlier, data compression is of value in data storage as well. Although this discussion will be framed in the terminology of data transmission, compression and decompression of data files for storage is essentially the same task as sending and receiving compressed data over a communication channel. The focus of this paper is on algorithms for data compression; it does not deal with hardware aspects of data transmission. The reader is referred to Cappellini for a discussion of techniques with natural hardware implementation [Cappellini 1985].

Background concepts in the form of terminology and a model for the study of data compression are provided in Section 1. Applications of data compression are also discussed in Section 1, to provide motivation for the material which follows.

While the primary focus of this survey is data compression methods of general utility, Section 2 includes examples from the literature in which ingenuity applied to domain-specific problems has yielded interesting coding techniques. These techniques are referred to as semantic dependent since they are designed to exploit the context and semantics of the data to achieve redundancy reduction. Semantic dependent techniques include the use of quadtrees, run length encoding, or difference mapping for storage and transmission of image data [Gonzalez and Wintz 1977; Samet 1984].

General-purpose techniques, which assume no knowledge of the information content of the data, are described in Sections 3-5. These descriptions are sufficiently detailed to provide an understanding of the techniques. The reader will need to consult the references for implementation details. In most cases, only worst-case analyses of the methods are feasible. To provide a more realistic picture of their effectiveness, empirical data is presented in Section 6. The susceptibility to error of the algorithms surveyed is discussed in Section 7 and possible directions for future research are considered in Section 8.

NEXT section