How to decode LZ4
As background for the following post in this series, I need to describe the LZ4 format and how to decode it. This was going to be part of the next post, but it got way too big. And, anyway, maybe this will be of use to someone on its own, and it'll be easier to find here than in the middle of a wall of text.
How to decode LZ4 data
The LZ4 format is very simple, and the decoder algorithm is (roughly) as follows:
- Read a byte. This is called a token and it contains two fields:
- 4 bits of literal-length
- 4 bits of match-length
- Copy a literal:
- If literal-length is
0xFthen read an extended-literal-length sequence:- Read another byte (an extended-length byte) and add its value to literal-length.
- If that extended-length byte was itself
0xFF, read another extended-length byte and also add its value to literal-length. Keep reading and adding extended-length bytes until you hit one that's not0xFF.
- Copy literal-length bytes from the input stream to the output.
- If literal-length is
- Copy a match:
- Read two bytes of match-distance. These are a little-endian encoded 16-bit unsigned integer which must not be zero.
- If match-length is
0xFF, read an extended-match-length sequence in exactly the manner as the extended-literal-length sequences above. - Add 4 to match-length.
- Copy match-length bytes from match-distance bytes behind the current end of the output stream to the output stream.
- That's it for the current token. Loop back to step 1 to handle the next one.
Note that match-length can be greater than match-distance, in which case you'll be copying data which was just decoded as a part of this token. This gives a sort of run-length encoding.
Knowing when to stop
But wait! If there's always a match and the match is always at least 4 bytes long, then how could one encode a file that's less than 4 bytes, or one which simply doesn't have any matching sequences?
Well, here's the trick: as far as the match-distance is concerned, the last token might contain gibberish (generally truncated) instructions. You have to know how long the output data is and stop processing the final token once you've written that many bytes.
The 2-byte match-distance
The encoder compressed data by finding repetition in the data stream. But in order to keep encoding and decoding efficient, there needs to be some limit on how hard it tries to find and leverage these repeated sequences. In the case of this algorithm, that limit is a 64K sliding window, which is why match-distance is just 16 bits.
This also means that if you wish to suspend and then resume an LZ4 decoder, you have to keep the last-written 64K of data around as state, since it might (in practice: will) want to look back into that window when it's resumed. And that sets the stage for the next post in this series.