A Compression Survey

I did some brief research online today because I really needed a refresher on the state of off-the-shelf compression routines. In what follows I try to provide a relatively short, high level, survey of what's out there. Though I do touch on which major algorithms exist, I try to stay away from the details on how they work.

Archive Formats

First, its important to distinguish between archiving and compressing.

Archiving takes multiple files and combines them into a single file; it helps you to package your data. Compressing takes an individual file and makes it smaller.

There are archive formats that don't provide compression, and there are compression formats that don't provide archiving. Most formats that provide compression actually use one of several different types of compression algorithms.Some algorithms are themselves actually combinations of multiple algorithms used together to achieve optimal compression.

Although this sounds like this discussion could be pretty complicated --there are actually very few unique compression algorithms running around. Probably the two most commonly used archive formats are ZIP and TAR.

TAR provides a linear, uncompressed, archive format. Archives in the TAR format are occasionally referred to as "tarballs".Each file entry in a tarball begins with a header that stores the length of the file -- the archive looks like: header, file; header, file; header, file; etc.
This suits its tape-based origins and may come in handy for streaming applications but can be a problem if you need to the ability to search for a specific file in the archive.

ZIP provides an indexed archive format -- that is files can be found within a ZIP'd package without have to search through the whole package. ZIP also provides a default compression mechanism called: "DEFLATE". ZIP'ing a large number of small files together can be larger than first TARing the files and then compressing them because ZIP cant take advantage of similarities between files in the same archive whereas TAR makes many files physically one file.

The ZIP format is open source. The Info-Zip community provides a free libraries to work with both ZIP and DEFLATE.

JAVA and Python both have standard libraries incorporating the Info-Zip code.

The UnZip page has links to documentation on the format -- the primary documentation is stored as a text file (appnote.iz) available through ftp.

Pkware, maintains its own documentation on the zip format also as an text file!?!. Pkware has introduced things such as compression to the format, but the general structure remains the same ( so as to interop with existing software ) You can also purchase Pkware's SDK for $3500 -- i'm not sure what it provides above and beyond something like Info-Zip and zlib. For the ascii averse you can find a graphical description of the zip format as used in the Java SDK: here.

Interestingly, Id software uses ZIP for its archives. The extensions they use are .pk3 and .pk4 but they are really just ZIP. Many commercial, proprietary looking formats, are likewise really ZIP under the hood.

As an aside: I have always thought "ZIP" to refer to something about closing files up together ( like a "zipper" ) actually refers to the speed of its default compression mechanism ( which the author felt was speedy, or "zippy" )

RAR, like ZIP, is both an archive format and a (set of) compression formats. It theoretically compresses smaller than the ZIP methods the trade off being that it takes longer to create the archive.

RAR is touted for its ability to create archives that can span multiple volumes (eg. disks); it can also create a "solid" format -- that is one in which all of the separate files can be treated as one large single file. The RAR decoder is open source but the encoder is shareware.

Compression Formats

GZIP is not ZIP -- it does, just like ZIP, use the DEFLATE algorithm but GZIP does not provide an archiving capability. I believe GZIP was created for *NIX to provide ZIP like ( ie. DEFLATE ) compression for use with TAR'd files.

bzip2, also created for *NIX I believe, compresses many files better than DEFLATE although it takes correspondingly takes longer than DEFLATE to run.
Apparently, it compresses text extremely well.

ZLib
Zlib can use a fixed-sized allocation to decompress an arbitrarily long stream of compressed data. It uses DEFALTE ( though it can be extended to handle more compression types ) and has a very liberal license. Because of these two key features it should probably be the first stop for anyone trying to implement compression in their game. Although ZLib does not itself support archiving, its distribution does come with a minimal zip/unzip implementation. The standard distribution supports many platforms via standard C, as well as the .NET platform via C#. The Java and Python SDKs both comes with their own implementation of DEFALTE/GZip based on ZLib.

Compression Algorithms

There are two classifications of compression algorithms: lossy and lossless. Lossless algorithms give you back exactly what you put in; lossy algorithms generally provide domain specific compression (see below) and give you back a "close enough" representation of your data.

During my brief research foray I was only really interested in lossless compression.

What follows are mostly edited quotes from online sources -- not purely text of my own.
Its possible that some of my contracted prose has introduced errors -- if you are interested in a particular format I would suggest google and the wikipedia. In most cases the wikipedia provided me with a great jumping off point.

DEFLATE combines two separate algorithms. LZ77 ( or is it LZSS? ) and Hoffman encoding.DEFLATE is widely used in many tools and formats.

Huffman coding sorts the frequency of data snippets via a binary tree then stores that tree in a very small stream of bits. It apparently works wonders on storing dictionary like data.

LZ77: replaces sequences of data with references to previously encountered sequences of data; in some sense building an in-place dictionary. The reference gets encoded with two numbers: a distance, designating how far back into the previous data the current sequence starts, and a length, representing the number of characters for which the current sequence is identical.

LZSS: provides a popular derivate of LZ77 -- where LZ77 outputs an offset/length pair -- even when the match is only just a literal byte -- LZSS uses one bit to indicate what's next: a literal byte or a pair of offset/length.

LZ78: scans through future data and attempts to build a dictionary of data sequences -- data sequences in the original data are replaced with references to the dictionary. Patent issues may exist with LZ78 and that is thought to have kept people away from the format.

LZW: extends LZ78 in a way that works great with text. Its used with Adobe Acrobat.
Burrows-Wheeler transform rotates a window of bytes to perform a reversible sort on the window of data. Although it does this by adding more data overall to the mix it sets up the original data in a way so that it can compress well via other algorithms.

Domain Specific Compression

In many cases algorithms can be custom tailored to the kind of data you have. With audio files for instance, its possible to break a sound stream into frequency bands and then compress each band separately. In the audio realm alone there are a ton of algorithms and formats that operate on that basic principle: OGG, MP3, ACC, WMA, etc. Its probably worth your time to do some searches for your specific content types to see what's out there.

Additional Resources

Discusses compression and archive file formats: http://schmidt.devlib.org/file-formats/archive-file-formats.html

Provides summaries and links to more information on many different data compression algorithms and formats: http://www.datacompression.info/index.shtml

The big question for games is not really how fast things compress but how fast they decompress.
I'll try to hunt down some sites that compare and contrast.

Bink supports its own compression techniques, some of which are also obtainable via Granny.
Not sure what information RAD has publicly on their site.

0 comments: