Visualizing Entropy in Binary Files
A lot of the data we store is redundant. We tell our computers to keep record of the same information over and over. Compression algorithms are an attempt to avoid that waste.
Using Python's zlib package we can see how a human-readable pattern is turned into a noisy string of bytes:
import zlib
import base64
uncompressed = "ab" * 1024 # 2KiB of abababab...abababab
compressed = base64.b64encode(zlib.compress(test.encode('utf-8'))).decode('utf-8')
print(compressed) # eJxLTEochaNwFI7CUTgKR+EIgwBC7gwu
Some compressing algorithms use domain-specific knowledge to reduce the number of bytes needed to store something. We need to look at data on a byte level in order to remove repetition in a way agnostic to the type of file being compressed. However, the ability to witness the compression usually comes down to nothing more than seeing a file's size decrease. You don't normally get to see the repetitiousness of the uncompressed file - or the noise of the compressed file.
But if you naively interpret the bytes of a file as pixels in an image, you can get a feel for what's happening to the file when it's compressed.
Here is an excerpt from a 2-channel WAV file as it looks when converted into an image (with 1 bit of audio data per pixel).
It's actually possible to see the two seperate channels of the WAV file. Some areas of this excerpt look noisy, but there's a lot of repetition. For example, the black bar in the middle on the image.
After compressing the WAV file as an MP3 it's apparent that something has changed
This pretty much looks like static. The information entropy in this file is much higher per byte on disk.
The same phenomena can be noticed in image files. Here's an uncompressed BMP image.
Interpretting the image as raw bytes yields this visualization (cropped,
24 bits of image data per pixel).
And when compressed as a PNG image…
Again, this image looks like pure noise (with exception to the header and footer data). Interestingly, this image has significantly smaller dimensions than that of the original BMP image. The image size is visually representative of the file's size.
The images in this post were rendered with a tool called binimage which I wrote.