Visualizing Entropy in Binary Files

Sep 06, 2017

A lot of the data we store is redundant. We tell our computers to keep record of the same information over and over. Compression algorithms are an attempt to avoid that waste.

Using Python's zlib package we can see how a human-readable pattern is turned into a noisy string of bytes:

import zlib  
import base64  

uncompressed = "ab" * 1024  # 2KiB of abababab...abababab  
compressed = base64.b64encode(zlib.compress(test.encode('utf-8'))).decode('utf-8')  

print(compressed)  # eJxLTEochaNwFI7CUTgKR+EIgwBC7gwu

Some compressing algorithms use domain-specific knowledge to reduce the number of bytes needed to store something. We need to look at data on a byte level in order to remove repetition in a way agnostic to the type of file being compressed. However, the ability to witness the compression usually comes down to nothing more than seeing a file's size decrease. You don't normally get to see the repetitiousness of the uncompressed file - or the noise of the compressed file.

But if you naively interpret the bytes of a file as pixels in an image, you can get a feel for what's happening to the file when it's compressed.

Here is an excerpt from a 2-channel WAV file as it looks when converted into an image (with 1 bit of audio data per pixel).

WAV file as an image

It's actually possible to see the two seperate channels of the WAV file. Some areas of this excerpt look noisy, but there's a lot of repetition. For example, the black bar in the middle on the image.

After compressing the WAV file as an MP3 it's apparent that something has changed

MP3 file as an image

This pretty much looks like static. The information entropy in this file is much higher per byte on disk.

The same phenomena can be noticed in image files. Here's an uncompressed BMP image.

Uncompressed source image

Interpretting the image as raw bytes yields this visualization (cropped,
24 bits of image data per pixel).

BMP image bytes visualized

And when compressed as a PNG image…

PNG image bytes visualized

Again, this image looks like pure noise (with exception to the header and footer data). Interestingly, this image has significantly smaller dimensions than that of the original BMP image. The image size is visually representative of the file's size.

The images in this post were rendered with a tool called binimage which I wrote.

Tagged in : programming