User:Ken g6/Draft of LZSS

From Wikipedia, the free encyclopedia

The following is a partial translation from es:LZSS. After translation, I'll go through and edit it.:

The lz77 compression algorithm belongs to the family of lossless compressors, also called text compressors, thusly named because they do not omit information from the archive at compression, as opposed to compressors that use lossy algorithms, which omit some information but considerably diminish the size of the original archive, as is the case with MP3, MPG, jpeg, etc. archives.

Characteristics[edit]

Compressors based on lossless algorithms are used when the information to compress is critical and one cannot lose information, for example in executable archives, database tables, or any type of information that does not allow loss.

The lz77 algorithm is very useful because it is easy to implement and fairly efficient.

In 1977 Abraham Lempel and Jacob Ziv presented their dictionary-based compression algorithm, for text compression. ("Text compression" refers to lossless compression for any kind of data.) Until that time, all compression algorithms were basically static compressors. (They used static dictionaries.) The new algorithm was called lz77 (for obvious reasons). The method always consists of offsets or runs and lengths of text seen previously. It also includes in the output the next byte after a match, because the context (last bytes seen) of this byte is the phrase, and if it was not part of the phrase (the match), then there might not have been compression, thus, why spend time trying to find a match (or space) for it?

In 1982 James Storer and Thomas Szymanski based on the work of Lempel and Ziv, presented their algorithm, LZSS. The principal difference is in the output, lz77 always requires an offset/length pair, though if the match was of just one byte (in which case it would have used more than eight bits to represent a byte) the way that LZSS uses another trick to improve it: it uses flags, which occupy just one bit and tell us what comes next: a literal or an offset/length pair and this algorithm is what we actually use, but the lzss is commonly called lz77, so we will call it lz77 from this point onward, but it is important to remember that it can also be called LZSS. LZSS can also use binary trees or suffix trees to make searches more efficient.

Note: Many searching algorithms, when combined with the lz77 algorithm, are patented in the United States and other countries that respect software patents.[1]

The theory is very simple and intuitive. When you have a match (also called phrase or conjunction of bytes which have already been seen in the archive entered) in place of writing those bytes it writes the offset and length of the repetition: where it is and its length.

This is an algorithm based on a dictionary, because it maintains a dictionary (which in this case is known as a "Sliding Window") and it makes reference to it with offset/length pairs. This version of lz77, uses a sliding window, which has a maximum size, in that the window cannot be the complete archive, instead, the sliding window maintains the last bytes "seen".

Compression[edit]

Imagine compressing the text "ab ab". We read "ab", and write it without compressing it; then we read "ab" and write the following: with the "offset" of 0 there was a match of two repeated bytes.

Decompression[edit]

It is quite simple. First we read "ab " and then we copy the bytes from there like this:

Read ‘a’.  “a”
Read ‘b’.  “ab”
Read ‘ ‘.  “ab ”

Read offset/length. Copy two bytes from position 0. (“ab”) 
“ab ab”

How does it work?[edit]

But, how does the decompressor know what is an offset/length pair and what is an uncompressed byte? The answer is simple, we use a prefix, a bit that acts like a flag, like a switch with two states which allows us to know what type of data comes next.

If the prefix is 0, then what comes next is an uncompressed byte. If, on the other hand, the prefix is 1, then what comes next is an offset/length pair. We also call these prefixes "flags".

The offset/length pair is called a keyword. A keyword is a group of bits (or bytes) that contain some type of information used for the compressor and the decompressor. The other output possible from lz77 is a literal, which is simply an uncompressed byte, so that the output of lz77 can have three forms:

  1. Literals are simply uncompressed bytes.
  2. Keywords in our case are offset/length pairs.
  3. Flags simply indicate if the data that comes next is a literal or a keyword.

Now, for example, we see newly our string and a real output of an lz77 algorithm:

Read 'a'. No match.  Flag 0.  Literal 'a'.
Read 'b'. No match.  Flag 0.  Literal 'b'.
Read ' '. No match.  Flag 0.  Literal ' '.
Read 'a'  Match.     Flag 1.  Keyword: offset = 0, length = 2.

As you can see the flag only has two possible states, so we only need one bit to represent it. Now we will not need to represent the flags with complete bytes, we will need to work with bits. The output of this compression is called a bit stream, because it is a stream of symbols of variable size, and the minimal unit is the bit.

Sliding window[edit]

Looking at the previous example again, one may ask: Where do we look for the matches for a string? The answer is to look backwards, in the data that we have already processed. This is known as the sliding window. The sliding window is a buffer which maintains bytes before the current position in an archive. All the uncompressed bytes of the output (the literals) are added to the sliding window and also the bytes which form a match.

Let's see our example again:

Read 'a'. SW: "".    No match.  Flag 0.  Literal 'a'.
Read 'b'. SW: "a".   No match.  Flag 0.  Literal 'b'.
Read ' '. SW: "ab".  No match.  Flag 0.  Literal ' '.
Read 'a'  SW: "ab ". Match.     Flag 1.  Keyword: offset = 0, length = 2.

As you can see, when we look for matches, we compare the data that we have in our sliding window (SW) with the data at the current position.

Must we maintain a buffer with the data at the current position and a buffer with the sliding window? In some implementations this may be true, but the implementation that we teach here is not of this type; because together, the sliding window and the bytes in the current position, are not more than the archive. We only maintain one buffer, which contains the archive. Then we only need to consider the pointer at the current position, and the sliding window is just before this pointer. In fact it is recommended to have the complete archive (or at least a large block of it) and compress it, so we don't worry about reading more bytes.

Now let's talk about the sliding window. How large must it be? We can work with the complete archive, but we need to think about the offset necessary to specify the position or the match. This offset is not from the position 0 (the beginning of the archive) to the match, it is from the current position backwards. Thus in the example the offset must be 3 in position 0 (so, when decompressing, the decompressor obtains a three and subtracts it from the current offset). As you would expect, as the length of the window increases, we need more bits to store the pointer, so we must choose a size for the sliding window. 4096K is a frequently used size, but as the size of the sliding window increases, the compression gets better. Thus, the implementation can use any size and consider that if, for example, we select a length of 8192K, we need 13 bits for the offset.

Sizes[edit]

Another aspect that we must choose is the size of the length. So, how many bits should we use for the length? We can choose any size that we wish, but must consider that varying the number of bits for the length and the bits for the offset can improve compression in some archives and worsen it in others, so if we are designing a compressor for a unique archive, we must find the most appropriate values, on the other hand we will use a length of 0-32, of just 5 bits.

Another important aspect is the minimum length of a match. In our case we have elected to utilize 13 bits in the offset and 5 in the length, 18 bits in total, so a match will have to be, at least, 3 bytes. Because if we encode a match with two bytes (16 bits) and use 18 bits to store the string we are using 2 more bits than if we kept it as a literal.

Then another question arises. If we never have matches of 0, 1, or 2 bytes, then why have space for them in the length?

To obtain the greatest possible benefit. Our length, although it will be represented with 5 bits, its range will change from 0-32 to 3-35. How do we do this? We simply subtract from the length (before saving it) 3, and the decompressor will read it and add 3.

End marker[edit]

Now that we know how the decompression is done we need to note that the decompressor will need to know how to end or when to end. This can be accomplished in two ways:

  • A symbol that marks the end of the data. (A sentinel)
  • Comparing the string of bits with the length of the entire archive.

The second method may perhaps be preferable. It is a little slower, but at the same time as it is used to know the end of the data, it can be used in a possible interface and can help us to avoid some problems. Either way, if you want to use a data end marker we can use length zero for it. This is done in the following way:

  • The range will be from 3-34, in this case we need to subtract the length (when we add it) 2. Thus the range 1-32 is converted to 3-34, and the compressor only needs to bother with this at the compression of the archive, once the compression is finalized, the output offset/length has length of 0.
  • The decompressor then needs to verify every time that it reads a length if the length is 0, to know if it has reached the end of the archive.

Working with bits[edit]

As you can see, the offsets and lengths are of variable length, and the flags take just one bit, so we must use bits in place of bytes. This is very important in the majority of compression algorithms.

The majority of operations work with bytes and when writing information to a file the minimal unit is the byte, so, to write bits we make intelligent use of some instructions.

To accomplish this we can use ASM, or otherwise, it can be implemented in C. We will continue the operations with bits in ASM. The main idea is to keep a byte and a counter with the bits written, then when we have 8 bits, we write that byte to the file and begin anew with the next byte. Below is presented an example using instructions in ASM:

@@put_bits_loop:
push cx                ;the number of bits to write
mov bh,_byte_out       ;the output byte (where to write)
mov bl,_byte_in        ;the input byte (the bits to write)
mov al, bl             ;we store the byte to read from in al
shr al,1               ;we shift to the right al, first bit in the carry flag
xor ah, ah             ;put ah=0
adc ah,0               ;we add to ah 0 and the carry
mov bl, al             ;save the input byte
mov cl,_total_bits     ;the bits that we have writed
shl ah, cl             ;put the bit in his position by shifting it to the left
or bh, al              ;put the bit in the output byte
mov _byte_out,bh       ;guardarlo
inc _total_bits        ;the bits written
cmp _total_bits,8      ;Do we have write the whole byte?
jne @@no_write_byte    ;nop E-)
mov di,ptr_buffer      ;the pointer to the buffer
mov es:[di],bh         ;save the byte (es is the segment of the buffer)
inc di                 ;next byte in the buffer
mov ptr_buffer,di      ;save it for the next time
inc bytes_writed       ;when the buffer its full write it to
mov _byte_out,0        ;a file or something like that so the next time is clear
@@no_write_byte:       ;we saved it
pop cx                 ;more bits to write?
dec cx                 ;yes, repeat everything
jnz @@put_bits_loop

This presents the putbits routine, the names of the variables are explained above, but for completeness we present a description here:

  • _byte_out: the byte to be written in the output buffer, maintains the bits that we are currently writing.
  • _byte_in: the byte that contains the bits that we want to write.
  • _total_bits: the number of bits that have been writting so far, zero initially.
  • Ptr_buffer: pointer to the buffer

When the routine begins, cx must have the number of bits to write, and _byte_in the bits to write. One must be careful, after entering the loop to check whether cx is 0 because if it is zero and you do not test it will write a bit, then it will make cx - 1, which gives 255 leading to the writing of 255 bits.

Remember:

Test cx, cx
jz@@put_bits_end

This is the structure (as the bits are written) for a byte:

Bit 8Bit 7Bit 6Bit 5Bit 4Bit 3Bit 2Bit 1

When all the bits are written (the compression has finished) it must check if there are any bits waiting to be written, so if there are any (total_bits != 0) we write _byte_out, and increment all pointers so that we don't leave data unwritten.

Output file[edit]

Now we need to define the format of the output file, which will be simple, it just needs to contain our necessities, the compressed data will look like:

  • First a word with the size of the original file and if you want some numbers for identification.
  • Then the bitstream, which constitutes and contains the compressed data.

Pseudo Code[edit]