5.0 KiB

Information Theory

A quick introduction to compression

When we store or transmit data, it makes sense to try to reduce the storage space or the transmission delay. We are looking for two algorithms : one that can code a chunk of data to some smaller sized chunk of data; and, a second one that can from the compressed data reconstruct (perfectly or not) the initial larger chunk of data.

When we can reconstruct the data, we are performing lossless compression, otherwise we speak of irreversible or lossy compression.

We shall provide an introduction with simple methods.

Image compression

If the data repeats locally some symbols, for example long sequences of white symbols in some black and white image, we may gain a lot of space if we describe the sequence using a number for the length of the sequence together with the repeated symbol.

This so called run length encoding was used for early file images before the gif format appeared. The gif formal itself uses also another compression method known as LSW.

In contrast with the above where there is no loss of information, let us cite the jpg format which aims at compressing data with loss of indormation, but when performed adequatly the human eye can not detect the loss of information.

If you are interested in patent controversy, which seems to be an american sport, there are historical examples with both the gif and jpg format.

A digression : vectorial graphics format.

So far we mentionned only so called raster graphics formats, i.e. rectangles of pixels. There is an alternative, in particular for artifical images : icons, diagrams, maps etc.

In contrast in a vectorial format, rather than describing the image, the file will describe in a mini language a recipe explaining how to draw it. This means in particular that on a screen : the image can be zoomed to an arbitrary precision.

One format which includes as an essential component this idea is the SVG format, now a standard on the web. Basically an SVG file looks similar to an html file (it is an XML file). Just like HTML, this XML text format allows also many manipulation by scripting techniques via the so called Document Object Model (DOM). For example, to highlight part of the image if something is selected.

You may draw some svg pictures instead of coding using for example inkscape.

An experiment : size does matter.

We consider several files in the folder jokeTextInImage.

Save them on a computer on which you have access to a terminal with basic linux commands.

Use the command ls with the appropriate option to find out the size that the data requires.

NB. the jpg file is the original "joke" I found on the web. The txt, svg and html were crafted by hand to provide the same information content (txt) and mimic the displayed format (svg, html). The other two png files were obtained by using the print screen option.

ls -Shl

The manual explain the role of the options.

man ls

This variant gives larger sizes (actual size on disk) because it includes meta data not just the data, and also the way the disk is organised means that block of certain minimal size are used (probably at least 4kB).

ls -sSh

Hang on : what does the size mean here? What is the actual unit?

Byte

En français on dit donc un octet (8 bits). For historical reasons, byte may mean e.g. 6 bits, but in practice it is now uniformly understood as 8 bits.

We can compute the size of the text file without too much problem. We just have to count the number of characters. We already discussed ASCII and we know that we need 7 bits, and in practice 8 bits, for a basic character as used in english speaking countries.

The following command can help us to the counting (manual of the command wc)

man wc

Number of bytes

wc -c joke_as_a_text_file.txt

Number of characters

wc -m joke_as_a_text_file.txt

is the same here.

It does coincide with the data size given by the ls command.

We can reproduce the experiment with the html and svg files (they are also text files) with the same result.

For the jpg file it is harder to understand why the image has this data size.

A first concrete example of compression : image and run length encoding

activité informatique débranché irem clermont

Variable length encoding : Hufman code.

Hufman

Archive

tar

Backup

rsync