Data Compression

1. FUNDAMENTAL CONCEPTS

A brief introduction to information theory is provided in this section. The definitions and assumptions necessary to a comprehensive discussion and evaluation of data compression methods are discussed. The following string of characters is used to illustrate the concepts defined: EXAMPLE = aa bbb cccc ddddd eeeeee fffffffgggggggg.

1.1 Definitions

A code is a mapping of source messages (words from the source alphabet alpha) into codewords (words of the code alphabet beta). The source messages are the basic units into which the string to be represented is partitioned. These basic units may be single symbols from the source alphabet, or they may be strings of symbols. For string EXAMPLE, alpha = { a, b, c, d, e, f, g, space}. For purposes of explanation, beta will be taken to be { 0, 1 }. Codes can be categorized as block-block, block-variable, variable-block or variable-variable, where block-block indicates that the source messages and codewords are of fixed length and variable-variable codes map variable-length source messages into variable-length codewords. A block-block code for EXAMPLE is shown in Figure 1.1 and a variable-variable code is given in Figure 1.2. If the string EXAMPLE were coded using the Figure 1.1 code, the length of the coded message would be 120; using Figure 1.2 the length would be 30.

source message   codeword             source message   codeword

     a             000                    aa             0
     b             001                    bbb            1
     c             010                    cccc           10
     d             011                    ddddd          11
     e             100                    eeeeee         100
     f             101                    fffffff        101
     g             110                    gggggggg       110
   space           111                    space          111

Figure 1.1: A block-block code     Figure 1.2: A variable-variable code.

The oldest and most widely used codes, ASCII and EBCDIC, are examples of block-block codes, mapping an alphabet of 64 (or 256) single characters onto 6-bit (or 8-bit) codewords. These are not discussed, as they do not provide compression. The codes featured in this survey are of the block-variable, variable-variable, and variable-block types.

When source messages of variable length are allowed, the question of how a message ensemble (sequence of messages) is parsed into individual messages arises. Many of the algorithms described here are defined-word schemes. That is, the set of source messages is determined prior to the invocation of the coding scheme. For example, in text file processing each character may constitute a message, or messages may be defined to consist of alphanumeric and non-alphanumeric strings. In Pascal source code, each token may represent a message. All codes involving fixed-length source messages are, by default, defined-word codes. In free-parse methods, the coding algorithm itself parses the ensemble into variable-length sequences of symbols. Most of the known data compression methods are defined-word schemes; the free-parse model differs in a fundamental way from the classical coding paradigm.

A code is distinct if each codeword is distinguishable from every other (i.e., the mapping from source messages to codewords is one-to-one). A distinct code is uniquely decodable if every codeword is identifiable when immersed in a sequence of codewords. Clearly, each of these features is desirable. The codes of Figure 1.1 and Figure 1.2 are both distinct, but the code of Figure 1.2 is not uniquely decodable. For example, the coded message 11 could be decoded as either ddddd or bbbbbb. A uniquely decodable code is a prefix code (or prefix-free code) if it has the prefix property, which requires that no codeword is a proper prefix of any other codeword. All uniquely decodable block-block and variable-block codes are prefix codes. The code with codewords { 1, 100000, 00 } is an example of a code which is uniquely decodable but which does not have the prefix property. Prefix codes are instantaneously decodable; that is, they have the desirable property that the coded message can be parsed into codewords without the need for lookahead. In order to decode a message encoded using the codeword set { 1, 100000, 00 }, lookahead is required. For example, the first codeword of the message 1000000001 is 1, but this cannot be determined until the last (tenth) symbol of the message is read (if the string of zeros had been of odd length, then the first codeword would have been 100000).

A minimal prefix code is a prefix code such that if x is a proper prefix of some codeword, then x sigma is either a codeword or a proper prefix of a codeword, for each letter sigma in beta. The set of codewords { 00, 01, 10 } is an example of a prefix code which is not minimal. The fact that 1 is a proper prefix of the codeword 10 requires that 11 be either a codeword or a proper prefix of a codeword, and it is neither. Intuitively, the minimality constraint prevents the use of codewords which are longer than necessary. In the above example the codeword 10 could be replaced by the codeword 1, yielding a minimal prefix code with shorter codewords. The codes discussed in this paper are all minimal prefix codes.

In this section, a code has been defined to be a mapping from a source alphabet to a code alphabet; we now define related terms. The process of transforming a source ensemble into a coded message is coding or encoding. The encoded message may be referred to as an encoding of the source ensemble. The algorithm which constructs the mapping and uses it to transform the source ensemble is called the encoder. The decoder performs the inverse operation, restoring the coded message to its original form.

1.2 Classification of Methods

In addition to the categorization of data compression schemes with respect to message and codeword lengths, these methods are classified as either static or dynamic. A static method is one in which the mapping from the set of messages to the set of codewords is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message ensemble. The classic static defined-word scheme is Huffman coding [Huffman 1952]. In Huffman coding, the assignment of codewords to source messages is based on the probabilities with which the source messages appear in the message ensemble. Messages which appear more frequently are represented by short codewords; messages with smaller probabilities map to longer codewords. These probabilities are determined before transmission begins. A Huffman code for the ensemble EXAMPLE is given in Figure 1.3. If EXAMPLE were coded using this Huffman mapping, the length of the coded message would be 117. Static Huffman coding is discussed in Section 3.2. Other static schemes are discussed in Sections 2 and 3.

   source message   probability      codeword

        a             2/40           1001
        b             3/40           1000
        c             4/40           011
        d             5/40           010
        e             6/40           111
        f             7/40           110
        g             8/40           00
      space           5/40           101

Figure 1.3 -- A Huffman code for the message EXAMPLE (code length=117).

A code is dynamic if the mapping from the set of messages to the set of codewords changes over time. For example, dynamic Huffman coding involves computing an approximation to the probabilities of occurrence "on the fly", as the ensemble is being transmitted. The assignment of codewords to messages is based on the values of the relative frequencies of occurrence at each point in time. A message x may be represented by a short codeword early in the transmission because it occurs frequently at the beginning of the ensemble, even though its probability of occurrence over the total ensemble is low. Later, when the more probable messages begin to occur with higher frequency, the short codeword will be mapped to one of the higher probability messages and x will be mapped to a longer codeword. As an illustration, Figure 1.4 presents a dynamic Huffman code table corresponding to the prefix aa bbb of EXAMPLE. Although the frequency of space over the entire message is greater than that of b, at this point in time b has higher frequency and therefore is mapped to the shorter codeword.

   source message   probability      codeword

        a             2/6            10
        b             3/6            0
      space           1/6            11

Figure 1.4 -- A dynamic Huffman code table for the prefix
              aa bbb of message EXAMPLE.

Dynamic codes are also referred to in the literature as adaptive, in that they adapt to changes in ensemble characteristics over time. The term adaptive will be used for the remainder of this paper; the fact that these codes adapt to changing characteristics is the source of their appeal. Some adaptive methods adapt to changing patterns in the source [Welch 1984] while others exploit locality of reference [Bentley et al. 1986]. Locality of reference is the tendency, common in a wide variety of text types, for a particular word to occur frequently for short periods of time then fall into disuse for long periods.

All of the adaptive methods are one-pass methods; only one scan of the ensemble is required. Static Huffman coding requires two passes: one pass to compute probabilities and determine the mapping, and a second pass for transmission. Thus, as long as the encoding and decoding times of an adaptive method are not substantially greater than those of a static method, the fact that an initial scan is not needed implies a speed improvement in the adaptive case. In addition, the mapping determined in the first pass of a static coding scheme must be transmitted by the encoder to the decoder. The mapping may preface each transmission (that is, each file sent), or a single mapping may be agreed upon and used for multiple transmissions. In one-pass methods the encoder defines and redefines the mapping dynamically, during transmission. The decoder must define and redefine the mapping in sympathy, in essence "learning" the mapping as codewords are received. Adaptive methods are discussed in Sections 4 and 5.

An algorithm may also be a hybrid, neither completely static nor completely dynamic. In a simple hybrid scheme, sender and receiver maintain identical codebooks containing k static codes. For each transmission, the sender must choose one of the k previously-agreed-upon codes and inform the receiver of his choice (by transmitting first the "name" or number of the chosen code). Hybrid methods are discussed further in Section 2 and Section 3.2.

1.3 A Data Compression Model

In order to discuss the relative merits of data compression techniques, a framework for comparison must be established. There are two dimensions along which each of the schemes discussed here may be measured, algorithm complexity and amount of compression. When data compression is used in a data transmission application, the goal is speed. Speed of transmission depends upon the number of bits sent, the time required for the encoder to generate the coded message, and the time required for the decoder to recover the original ensemble. In a data storage application, although the degree of compression is the primary concern, it is nonetheless necessary that the algorithm be efficient in order for the scheme to be practical. For a static scheme, there are three algorithms to analyze: the map construction algorithm, the encoding algorithm, and the decoding algorithm. For a dynamic scheme, there are just two algorithms: the encoding algorithm, and the decoding algorithm.

Several common measures of compression have been suggested: redundancy [Shannon and Weaver 1949], average message length [Huffman 1952], and compression ratio [Rubin 1976; Ruth and Kreutzer 1972]. These measures are defined below. Related to each of these measures are assumptions about the characteristics of the source. It is generally assumed in information theory that all statistical parameters of a message source are known with perfect accuracy [Gilbert 1971]. The most common model is that of a discrete memoryless source; a source whose output is a sequence of letters (or messages), each letter being a selection from some fixed alphabet a,... The letters are taken to be random, statistically independent selections from the alphabet, the selection being made according to some fixed probability assignment p(a),... [Gallager 1968]. Without loss of generality, the code alphabet is assumed to be {0,1} throughout this paper. The modifications necessary for larger code alphabets are straightforward.

It is assumed that any cost associated with the code letters is uniform. This is a reasonable assumption, although it omits applications like telegraphy where the code symbols are of different durations. The assumption is also important, since the problem of constructing optimal codes over unequal code letter costs is a significantly different and more difficult problem. Perl et al. and Varn have developed algorithms for minimum-redundancy prefix coding in the case of arbitrary symbol cost and equal codeword probability [Perl et al. 1975; Varn 1971]. The assumption of equal probabilities mitigates the difficulty presented by the variable symbol cost. For the more general unequal letter costs and unequal probabilities model, Karp has proposed an integer linear programming approach [Karp 1961]. There have been several approximation algorithms proposed for this more difficult problem [Krause 1962; Cot 1977; Mehlhorn 1980].

When data is compressed, the goal is to reduce redundancy, leaving only the informational content. The measure of information of a source message x (in bits) is -lg p(x) [lg denotes the base 2 logarithm]. This definition has intuitive appeal; in the case that p(x=1, it is clear that x is not at all informative since it had to occur. Similarly, the smaller the value of p(x, the more unlikely x is to appear, hence the larger its information content. The reader is referred to Abramson for a longer, more elegant discussion of the legitimacy of this technical definition of the concept of information [Abramson 1963, pp. 6-13]. The average information content over the source alphabet can be computed by weighting the information content of each source letter by its probability of occurrence, yielding the expression SUM{i=1 to n} [-p(a(i)) lg p(a(i))]. This quantity is referred to as the entropy of a source letter, or the entropy of the source, and is denoted by H. Since the length of a codeword for message a(i) must be sufficient to carry the information content of a(i), entropy imposes a lower bound on the number of bits required for the coded message. The total number of bits must be at least as large as the product of H and the length of the source ensemble. Since the value of H is generally not an integer, variable length codewords must be used if the lower bound is to be achieved. Given that message EXAMPLE is to be encoded one letter at a time, the entropy of its source can be calculated using the probabilities given in Figure 1.3: H = 2.894, so that the minimum number of bits contained in an encoding of EXAMPLE is 116. The Huffman code given in Section 1.2 does not quite achieve the theoretical minimum in this case.

Both of these definitions of information content are due to Shannon. A derivation of the concept of entropy as it relates to information theory is presented by Shannon [Shannon and Weaver 1949]. A simpler, more intuitive explanation of entropy is offered by Ash [Ash 1965].

The most common notion of a "good" code is one which is optimal in the sense of having minimum redundancy. Redundancy can be defined as: SUM p(a(i)) l(i) - SUM [-p(a(i)) lg p(a(i))] where l(i) is the length of the codeword representing message a(i). The expression SUM p(a(i)) l(i) represents the lengths of the codewords weighted by their probabilities of occurrence, that is, the average codeword length. The expression SUM [-p(a(i)) lg p(a(i))] is entropy, H. Thus, redundancy is a measure of the difference between average codeword length and average information content. If a code has minimum average codeword length for a given discrete probability distribution, it is said to be a minimum redundancy code.

We define the term local redundancy to capture the notion of redundancy caused by local properties of a message ensemble, rather than its global characteristics. While the model used for analyzing general-purpose coding techniques assumes a random distribution of the source messages, this may not actually be the case. In particular applications the tendency for messages to cluster in predictable patterns may be known. The existence of predictable patterns may be exploited to minimize local redundancy. Examples of applications in which local redundancy is common, and methods for dealing with local redundancy, are discussed in Section 2 and Section 6.2.

Huffman uses average message length, SUM p(a(i)) l(i), as a measure of the efficiency of a code. Clearly the meaning of this term is the average length of a coded message. We will use the term average codeword length to represent this quantity. Since redundancy is defined to be average codeword length minus entropy and entropy is constant for a given probability distribution, minimizing average codeword length minimizes redundancy.

A code is asymptotically optimal if it has the property that for a given probability distribution, the ratio of average codeword length to entropy approaches 1 as entropy tends to infinity. That is, asymptotic optimality guarantees that average codeword length approaches the theoretical minimum (entropy represents information content, which imposes a lower bound on codeword length).

The amount of compression yielded by a coding scheme can be measured by a compression ratio. The term compression ratio has been defined in several ways. The definition C = (average message length)/(average codeword length) captures the common meaning, which is a comparison of the length of the coded message to the length of the original ensemble [Cappellini 1985]. If we think of the characters of the ensemble EXAMPLE as 6-bit ASCII characters, then the average message length is 6 bits. The Huffman code of Section 1.2 represents EXAMPLE in 117 bits, or 2.9 bits per character. This yields a compression ratio of 6/2.9, representing compression by a factor of more than 2. Alternatively, we may say that Huffman encoding produces a file whose size is 49% of the original ASCII file, or that 49% compression has been achieved. A somewhat different definition of compression ratio, by Rubin, C = (S - O - OR)/S, includes the representation of the code itself in the transmission cost [Rubin 1976]. In this definition S represents the length of the source ensemble, O the length of the output (coded message), and OR the size of the "output representation" (eg., the number of bits required for the encoder to transmit the code mapping to the decoder). The quantity OR constitutes a "charge" to an algorithm for transmission of information about the coding scheme. The intention is to measure the total size of the transmission (or file to be stored).

1.4 Motivation

As discussed in the Introduction, data compression has wide application in terms of information storage, including representation of the abstract data type string [Standish 1980] and file compression. Huffman coding is used for compression in several file archival systems [ARC 1986; PKARC 1987], as is Lempel-Ziv coding, one of the adaptive schemes to be discussed in Section 5. An adaptive Huffman coding technique is the basis for the compact command of the UNIX operating system, and the UNIX compress utility employs the Lempel-Ziv approach [UNIX 1984].

In the area of data transmission, Huffman coding has been passed over for years in favor of block-block codes, notably ASCII. The advantage of Huffman coding is in the average number of bits per character transmitted, which may be much smaller than the lg n bits per character (where n is the source alphabet size) of a block-block system. The primary difficulty associated with variable-length codewords is that the rate at which bits are presented to the transmission channel will fluctuate, depending on the relative frequencies of the source messages. This requires buffering between the source and the channel. Advances in technology have both overcome this difficulty and contributed to the appeal of variable-length codes. Current data networks allocate communication resources to sources on the basis of need and provide buffering as part of the system. These systems require significant amounts of protocol, and fixed-length codes are quite inefficient for applications such as packet headers. In addition, communication costs are beginning to dominate storage and processing costs, so that variable-length coding schemes which reduce communication costs are attractive even if they are more complex. For these reasons, one could expect to see even greater use of variable-length coding in the future.

It is interesting to note that the Huffman coding algorithm, originally developed for the efficient transmission of data, also has a wide variety of applications outside the sphere of data compression. These include construction of optimal search trees [Zimmerman 1959; Hu and Tucker 1971; Itai 1976], list merging [Brent and Kung 1978], and generating optimal evaluation trees in the compilation of expressions [Parker 1980]. Additional applications involve search for jumps in a monotone function of a single variable, sources of pollution along a river, and leaks in a pipeline [Glassey and Karp 1976]. The fact that this elegant combinatorial algorithm has influenced so many diverse areas underscores its importance.