Name

    LZWH -- LZW Compression with Horspool Extensions

Name Strings

    lzwh.h, lzwh.c

Contact

    Jon Mayo (jon.mayo 'at' gmail.com)

IP Status

    Made by a machine. PUBLIC DOMAIN. (CC0-1.0)

Status

    Complete. Tested against the Calgary Corpus.

Version

    Last Modified Date: 2026-04-30
    Revision: 1

Number

    N/A

Dependencies

    C99 or later.
    <stddef.h>, <stdint.h>, <stdlib.h>, <string.h>
    No external library dependencies.

Abstract

    LZWH is a small, self-contained, embeddable compression library that
    combines LZW (Lempel-Ziv-Welch) dictionary compression with two
    enhancements from Horspool (1991): phased-in binary coding for output
    symbols and accelerated dictionary loading. The implementation
    consists of a single C source file and header with no external
    dependencies beyond the C standard library.

    The library targets use cases that need a simple, portable compressor
    that can be dropped into a project as two files. It is not intended
    to compete with general-purpose compressors like gzip/DEFLATE, which
    achieve substantially better ratios through sliding-window matching
    and entropy coding.

Introduction

    Standard LZW encodes output codes at fixed bit widths that double in
    range at each power-of-two dictionary size. This wastes fractional
    bits on every code and forces the decoder to track width transitions.

    Horspool's phased-in binary coding replaces the fixed-width scheme
    with a variable-length encoding that uses ceil(log2(n)) or
    floor(log2(n)) bits per code depending on its value, where n is the
    current dictionary size. For a range [0, n), let b = floor(log2(n))
    and threshold = 2^(b+1) - n. Codes below the threshold use b bits;
    codes at or above use b+1 bits. This eliminates the wasted fractional
    bits of power-of-two width stepping.

    Horspool's accelerated dictionary loading extends the standard LZW
    rule "add one entry per emitted code" by preloading additional
    dictionary entries derived from the matched string. After the
    standard entry is added on a mismatch, the compressor also inserts
    entries extending the previous code by each subsequent character of
    the new match, up to accel_max-1 additional entries. This fills the
    dictionary faster, allowing longer matches to be found sooner. The
    decompressor mirrors this by examining the decoded string and adding
    the same extra entries.

    The acceleration is controlled by the accel_max parameter:

        accel_max = 1    Standard LZW (one entry per code emitted)
        accel_max = 3    Two extra entries per emit (recommended for text)
        accel_max = 5    Four extra entries per emit

    Higher acceleration benefits repetitive text inputs where the
    dictionary converges to useful entries faster. On high-entropy or
    binary data, the larger dictionary increases code widths without
    enough repetition to compensate, and compression may be worse than
    standard LZW.

    Bit packing is LSB-first (least significant bit first). The
    compressed stream begins with a 4-byte little-endian original length
    header, followed by phased-in coded LZW symbols.

References

    R.N. Horspool, "Improving LZW," Proceedings of the Data
    Compression Conference (DCC '91), IEEE, 1991, pp. 332-341.

    T.A. Welch, "A Technique for High-Performance Data Compression,"
    IEEE Computer, vol. 17, no. 6, pp. 8-19, June 1984.

API

    The complete API consists of three functions and one structure,
    declared in lzwh.h.

    Types
    -----

    struct lzwh_params

        Configuration structure for compression and decompression.

        Members:

            unsigned dict_bits

                Maximum dictionary size as log2(entries). Valid range
                is 9 to 16 inclusive. Default is 16, giving a maximum
                of 65536 dictionary entries. Smaller values reduce
                memory usage at the cost of compression ratio.

            unsigned accel_max

                Accelerated loading depth. A value of 1 gives standard
                LZW behavior. Higher values preload more dictionary
                entries per emitted code, filling the dictionary faster.
                Values of 3 to 5 are recommended for text. Default is 1.

    Functions
    ---------

    void lzwh_params_init(struct lzwh_params *p)

        Initialize a params structure to default values (dict_bits=16,
        accel_max=1). Must be called before modifying individual fields.

    int lzwh_compress(const uint8_t *src, size_t src_len,
                      uint8_t **out, size_t *out_len,
                      const struct lzwh_params *params)

        Compress the input buffer <src> of <src_len> bytes. On success,
        allocates an output buffer via malloc(3) and returns it through
        <out> and <out_len>. The caller is responsible for freeing the
        output buffer.

        If <params> is NULL, default parameters are used.

        Returns LZWH_OK on success, or a negative error code:

            LZWH_OK          (0)   Success
            LZWH_ERR         (-1)  General error
            LZWH_ERR_NOMEM   (-2)  Memory allocation failed
            LZWH_ERR_CORRUPT (-3)  Should not occur during compression

    int lzwh_decompress(const uint8_t *src, size_t src_len,
                        uint8_t **out, size_t *out_len,
                        const struct lzwh_params *params)

        Decompress the buffer <src> of <src_len> bytes that was
        previously produced by lzwh_compress. On success, allocates an
        output buffer via malloc(3) and returns it through <out> and
        <out_len>. The caller is responsible for freeing the output
        buffer.

        The <params> must match those used during compression.
        If <params> is NULL, default parameters are used.

        Returns LZWH_OK on success, or a negative error code:

            LZWH_OK          (0)   Success
            LZWH_ERR         (-1)  General error
            LZWH_ERR_NOMEM   (-2)  Memory allocation failed
            LZWH_ERR_CORRUPT (-3)  Compressed data is malformed

    Constants
    ---------

    LZWH_DICT_BITS_DEFAULT   16
    LZWH_DICT_BITS_MIN        9
    LZWH_DICT_BITS_MAX       16
    LZWH_ACCEL_DEFAULT        1
    LZWH_OK                   0
    LZWH_ERR                 -1
    LZWH_ERR_NOMEM           -2
    LZWH_ERR_CORRUPT         -3

Usage

    To integrate LZWH into a project, copy lzwh.h and lzwh.c into the
    source tree and compile lzwh.c as part of the build. No special
    compiler flags are required beyond C99 mode.

    To build with GCC or Clang:

        cc -std=c99 -O2 -c lzwh.c

    The params structure must always be initialized before use via
    lzwh_params_init, then individual fields may be overridden.
    The same params must be passed to both compress and decompress.
    Passing NULL for params uses defaults (dict_bits=16, accel_max=1).

Example

    #include "lzwh.h"
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    int
    main(void)
    {
        const char *text = "hello world hello world hello world";
        size_t tlen = strlen(text);

        /* configure for accelerated loading */
        struct lzwh_params p;
        lzwh_params_init(&p);
        p.accel_max = 3;

        /* compress */
        uint8_t *comp = NULL;
        size_t comp_len = 0;
        int rc = lzwh_compress((const uint8_t *)text, tlen,
                               &comp, &comp_len, &p);
        if (rc != LZWH_OK) {
            fprintf(stderr, "compress failed: %d\n", rc);
            return 1;
        }
        printf("compressed %zu -> %zu bytes\n", tlen, comp_len);

        /* decompress */
        uint8_t *decomp = NULL;
        size_t decomp_len = 0;
        rc = lzwh_decompress(comp, comp_len,
                             &decomp, &decomp_len, &p);
        if (rc != LZWH_OK) {
            fprintf(stderr, "decompress failed: %d\n", rc);
            free(comp);
            return 1;
        }

        if (decomp_len == tlen &&
            memcmp(decomp, text, tlen) == 0)
            printf("round-trip verified\n");
        else
            printf("MISMATCH\n");

        free(comp);
        free(decomp);
        return 0;
    }

Benchmark Data

    All measurements taken on the Calgary Corpus (18 files, 3,251,493
    bytes total). Timing measured on an x86-64 Linux system with GCC
    -O2. Times are wall-clock per-file, single-threaded.

    Standard LZW (accel_max=1)
    --------------------------

    File             Original  Compressed   Ratio     Comp    Decomp
    ----------------------------------------------------------------
    bib               111,261      44,913   40.4%    3.2 ms    1.4 ms
    book1             768,771   1,091,606  142.0%   18.1 ms   15.5 ms
    book2             610,856     733,780  120.1%   14.5 ms   11.1 ms
    geo               102,400      73,919   72.2%    2.8 ms    1.6 ms
    news              377,109     362,197   96.0%   10.2 ms    6.3 ms
    obj1               21,504      13,401   62.3%    0.8 ms    0.4 ms
    obj2              246,814     139,868   56.7%    7.1 ms    3.7 ms
    paper1             53,161      24,232   45.6%    3.6 ms    0.8 ms
    paper2             82,199      34,881   42.4%    2.2 ms    1.1 ms
    paper3             46,526      21,326   45.8%    1.3 ms    0.6 ms
    paper4             13,286       6,681   50.3%    0.4 ms    0.2 ms
    paper5             11,954       6,325   52.9%    0.4 ms    0.2 ms
    paper6             38,105      18,047   47.4%    1.1 ms    0.6 ms
    pic               513,216      60,070   11.7%    8.6 ms    2.8 ms
    progc              39,611      18,481   46.7%    1.1 ms    0.6 ms
    progl              71,646      26,406   36.9%    1.8 ms    0.9 ms
    progp              49,379      18,611   37.7%    1.3 ms    0.6 ms
    trans              93,695      37,133   39.6%    2.6 ms    1.2 ms
    ----------------------------------------------------------------
    TOTAL           3,251,493   2,731,877   84.0%

    Horspool LZW (accel_max=3)
    --------------------------

    File             Original  Compressed   Ratio     Comp    Decomp
    ----------------------------------------------------------------
    bib               111,261      41,093   36.9%    3.4 ms    1.4 ms
    book1             768,771   1,384,494  180.1%   15.6 ms   17.9 ms
    book2             610,856   1,051,662  172.2%   13.1 ms   13.5 ms
    geo               102,400     118,157  115.4%    3.2 ms    2.0 ms
    news              377,109     604,824  160.4%    9.7 ms    8.1 ms
    obj1               21,504      12,974   60.3%    0.7 ms    0.4 ms
    obj2              246,814     358,932  145.4%    6.2 ms    5.3 ms
    paper1             53,161      22,404   42.1%    3.3 ms    0.8 ms
    paper2             82,199      32,659   39.7%    2.5 ms    1.1 ms
    paper3             46,526      20,198   43.4%    1.4 ms    0.7 ms
    paper4             13,286       6,359   47.9%    0.5 ms    0.3 ms
    paper5             11,954       5,930   49.6%    2.1 ms    0.2 ms
    paper6             38,105      16,610   43.6%    1.2 ms    0.6 ms
    pic               513,216     398,565   77.7%    9.7 ms    6.3 ms
    progc              39,611      16,843   42.5%    1.2 ms    0.6 ms
    progl              71,646      23,793   33.2%    1.9 ms    0.9 ms
    progp              49,379      16,335   33.1%    1.3 ms    0.6 ms
    trans              93,695      31,671   33.8%    2.7 ms    1.1 ms
    ----------------------------------------------------------------
    TOTAL           3,251,493   4,163,503  128.0%

    Horspool LZW (accel_max=5)
    --------------------------

    File             Original  Compressed   Ratio     Comp    Decomp
    ----------------------------------------------------------------
    bib               111,261      78,722   70.8%    3.2 ms    1.9 ms
    book1             768,771   1,423,910  185.2%   17.5 ms   18.0 ms
    book2             610,856   1,096,383  179.5%   13.3 ms   14.0 ms
    geo               102,400     121,890  119.0%    3.2 ms    2.0 ms
    news              377,109     636,149  168.7%    9.4 ms    8.6 ms
    obj1               21,504      12,949   60.2%    0.7 ms    0.4 ms
    obj2              246,814     383,256  155.3%    6.3 ms    5.3 ms
    paper1             53,161      21,901   41.2%    3.7 ms    0.7 ms
    paper2             82,199      36,296   44.2%    2.6 ms    1.2 ms
    paper3             46,526      19,950   42.9%    1.5 ms    0.7 ms
    paper4             13,286       6,313   47.5%    0.5 ms    0.2 ms
    paper5             11,954       5,878   49.2%    2.1 ms    0.2 ms
    paper6             38,105      16,297   42.8%    1.2 ms    0.6 ms
    pic               513,216     558,937  108.9%    9.9 ms    8.0 ms
    progc              39,611      16,360   41.3%    1.2 ms    0.6 ms
    progl              71,646      22,671   31.6%    2.0 ms    0.8 ms
    progp              49,379      15,571   31.5%    1.4 ms    0.6 ms
    trans              93,695      29,576   31.6%    2.8 ms    1.1 ms
    ----------------------------------------------------------------
    TOTAL           3,251,493   4,503,009  138.5%

    Comparison with gzip (DEFLATE)
    ------------------------------

    The following table compares aggregate Calgary Corpus compression
    ratios across methods. gzip uses LZ77 sliding-window matching with
    Huffman entropy coding, which is fundamentally more capable than
    dictionary-only LZW.

    Method                    Compressed     Ratio
    -----------------------------------------------
    gzip -1 (fast)            1,193,013      36.7%
    gzip -6 (default)         1,062,948      32.7%
    gzip -9 (best)            1,058,056      32.5%
    LZWH accel=1              2,731,877      84.0%
    LZWH accel=3              4,163,503     128.0%
    LZWH accel=5              4,503,009     138.5%

    On individual file classes, the acceleration improves ratio for
    structured, repetitive text (source code, bibliographies):

    File    accel=1    accel=3    accel=5    gzip -6
    ------------------------------------------------
    progp     37.7%      33.1%      31.5%     22.7%
    progl     36.9%      33.2%      31.6%     22.7%
    trans     39.6%      33.8%      31.6%     20.2%
    bib       40.4%      36.9%      70.8%     31.5%

    But hurts on high-entropy or binary data:

    File    accel=1    accel=3    accel=5    gzip -6
    ------------------------------------------------
    geo       72.2%     115.4%     119.0%     66.8%
    obj2      56.7%     145.4%     155.3%     33.0%
    book1    142.0%     180.1%     185.2%     40.7%
    news      96.0%     160.4%     168.7%     38.4%

    LZWH with accel_max=1 is competitive with standard LZW
    implementations on structured data and highly repetitive inputs
    (e.g., pic at 11.7%). The phased-in coding saves approximately
    0.5 bits per code on average compared to power-of-two width
    stepping used in classic LZW.

    The acceleration feature (accel_max > 1) is a specialized
    optimization. It is most effective on inputs with recurring
    substrings where filling the dictionary faster produces earlier
    long matches. On the general-purpose Calgary Corpus, it expands
    most files because the faster dictionary growth increases average
    code width without sufficient repetition to compensate.

Issues

    1. Compatibility

       RESOLVED: LZWH is not compatible with Unix compress(1), GIF LZW,
       or TIFF LZW. The phased-in coding, accelerated loading, and
       LSB-first bit packing all differ from standard LZW
       implementations. Data compressed with LZWH can only be
       decompressed with the same library and matching parameters.

    2. Dictionary full behavior

       RESOLVED: When the dictionary reaches its maximum size
       (2^dict_bits entries), no further entries are added. The
       compressor continues to match against existing entries. No clear
       code is used. This is the simplest strategy and avoids the
       complexity of dictionary reset heuristics.

    3. Accelerated entry range safety

       RESOLVED: The compressor may create dictionary entries via
       accelerated loading that have codes beyond the current phased-in
       coding range (because the decompressor has not yet seen the code
       that triggers those entries). If the compressor matches such an
       entry, emitting its code would overflow the phased-in coding.
       This is handled by rejecting matches to codes >= emit_range and
       falling through to the emit path. The entry remains in the
       dictionary and becomes usable once emit_range advances past it.

    4. Maximum input size

       RESOLVED: The original length is stored as a 4-byte little-endian
       value, limiting input to 4 GiB. This is sufficient for the
       intended use cases (directory listings, ELF executables).

    5. Thread safety

       RESOLVED: The library uses no global or static mutable state.
       Multiple threads may compress or decompress concurrently with
       independent buffers and params structures.

Revision History

    Rev.    Date          Changes
    ----    ----------    -------------------------------------------
    1       2026-04-30    Initial version. Standard LZW with Horspool's
                          phased-in coding and accelerated loading.
                          Verified against Calgary Corpus at accel
                          levels 1, 3, and 5.