Math in… Data Compression

Big files take a lot of space to store and a long time to download, so mathematicians and computer scientists have come up with many approaches to data compression.

The goal with data compression is to store a data set as a smaller amount of data. The rub is that that smaller data set somehow needs to contain enough info that it can get turned back into the original data when decompressed.

 
 

If the file is a program, it needs to be the exact same file before compression and after decompression or else it won’t run. This requires lossless compression.

A common lossless compression algorithm is LZW (Lempel–Ziv–Welch). The basic idea is that, if the data is a string of characters, some substrings might be more common than others. (In English, for example, 3-letter strings like THE or AND will be more common than strings like FZE and QQS.)

LZW starts with a list of characters and chunks the data according to the list. Each time it chunks a maximal already-listed substring, it adds a longer substring including the next character. Can you see how this works in the example below?

 

{A,B,C,H,N,O,R,W,HO}

HOWNOWBROWNCOWABROWNCOWHOWNOW

 

{A,B,C,H,N,O,R,W,HO,OW,WN,NO,OWB}

HOWNOWBROWNCOWABROWNCOWHOWNOW

 

{A,B,C,H,N,O,R,W,HO,OW,WN,NO,OWB,BR,RO,OWN,NC,CO,OWA,AB,BRO,OWNC,COW,WH,HOW,WNO}

HOWNOWBROWNCOWABROWNCOWHOWNOW

 

LZW keeps a table of shorthands for substrings that can be used to compress and decompress the data. Here’s a simplification of how that works:

A: a
B: b
C: c
H: d
N: e
O: f
R: g

W: h
HO:
i
OW:
j
WN:
k
NO:
l
OWB:
m
BR:
n

RO: o
OWN: p
NC: q
CO: r
OWA: s
AB: t
BRO: u

OWNC: v
COW: w
WH: x
HOW: y
WNO: z

original: HOWNOWBROWNCOWABROWNCOWHOWNOW

dfhej bgj q j an p cj i k j

compressed: dfhejbgjqjanpcjikj

The table has a set size to keep it small relative to the data, so at some point new shorthands are no longer created. The remaining data is compressed based on what’s already in the table.

LZW and its variants are popular in compressing formats like ZIP, RAR, GIF, TIFF, and PDF files.

Since digital photos and videos are already approximations of analog photos and video, it is not as important that they decompress exactly to their original forms. In lossy compression, an irreversible step is used.

GIFs are made by reducing the number of colors to 256 before using LZW to catalog the pixels. Color reduction is an irreversible step, so palettes of GIFs might seem “off.”

JPEGs use a discrete cosine transform (DCT) followed by an irreversible quantization step on blocks of the image. This can result in choppiness at borders of neighboring blocks called compression artifacts.

How are the files you use compressed?

Nick Rauh

Nick is a Seattle-based mathematician who has spent his career teaching at colleges and designing math activities for K-12 children. He is currently the Mathematician in Residence at the Seattle Universal Math Museum.

https://maththem.blogspot.com
Previous
Previous

Math in… Ranked Choice Voting

Next
Next

Math in… Misleading Graphs