Purpose of File Compression

File compression is the process of reducing the size of computer files by utilising more effective and non-redundant methods to convey the same amount of data for the purposes of:

Saving Disk Space – allows for more data to be stored on the storage device.
Faster Transmission – compressed files require less time and bandwidth to transmit over the internet, making them easier to distribute. It is also useful when dealing with file size limitations, such as email attachment limits (usually ~20-25Mb).
Organisation – businesses/individuals with large amounts of files or backup data can benefit greatly from compressing multiple files into a single archive, which not only saves disk space and makes management easier but also helps with data organisation.

Lossy

Lossy compression reduces the size of a file by permanently discarding unnecessary data that is deemed unnoticeable or redundant. This process is irreversible, meaning that the file cannot be fully restored to its original form after compression. It is commonly used for compressing images (JPEG), audio (MP3), and videos (MP4) but not documents or text files as it can result in missing characters and render the original text unreadable.

When compressing images, the algorithm aims to eliminate high-quality details that are not perceptible to the human eye. This process may involve blurring or smoothing high-frequency details, or grouping pixels with slight colour variations, which reduces the colour depth. (for further explanation on bit/colour depth, refer to the Common File Formats (Graphic) page). As the compression ratio increases, the loss of detail and colour degradation will become more noticeable and eventually result in visible artifacts (i.e. pixelation) in areas previously containing fine details or sharp colour transitions.

In audio file compression, the algorithm uses perceptual audio coding to analyse audio signals and discard high-frequency sounds that are beyond the range of human hearing. Additionally, psychoacoustics is employed to identify sounds that are masked by louder sounds and thus inaudible to the human ear (this has also been explained in the Common File Formats (Audio) page).

Videos can be compressed using similar methods to images, where details and colour depth are reduced. However, videos also have frames that can be compressed together. To reduce the frame rate, measured in frames per second (FPS), the algorithm can compress multiple frames or discard frames that have a high level of redundancy (e.g. a scene containing a stationary subject with only minor changes in lighting). This technique is called interframe compression or motion compensation. However, this can result in a reduction in the smoothness or fluidity of motion in the video.

Lossless

Lossless compression utilises reversible compression to allow the original data to be restored and rebuilt from the compressed data during decompression. This is achieved by identifying patterns and redundancies in the data and then rewriting it more efficiently using methods such as indexing, run-length encoding (RLE) and Huffman coding. Since lossless compression does not discard any data, it cannot reduce the file size as much as lossy compression. However, it is useful in cases where it is critical to maintain the exact quality, accuracy and integrity of the original data without any degradation (e.g. medical imaging, court trial evidence, business info). Some common lossless file formats include PNG, BMP, GIF, RAW, ZIP, and WAV.

Indexing

One way the algorithm can encode the data more efficiently is through indexing. This technique creates an index of repeating patterns within the data and represents them in a shorter form. The index table is then used as a reference during decoding/decompression to reconstruct the original data.

For instance, consider the following sentence “Education is the kindling of a flame, not the filling of a vessel” (Socrates).

This can be compressed by substituting repeating words, or even sequences within individual words, with abbreviated representations, thus becoming: “Education is ‡ kind§ ¤ a flame, not ‡ fil§ ¤ a vessel.”

And the index table for this would be:

Original = Representation

the = ‡

-ling = §

of = ¤

Run-length Encoding (RLE)

Run-length encoding is a technique that replaces a repeated sequence with a single data value and a count of how many times it appears consecutively. It is particularly useful in images where there are consecutive pixels with the same colour value. To illustrate, consider a binary image consisting of three rows of binary code (with “0” representing white and “1” representing black):

“0001110010111

11000001000110

00010011110000”

Instead of individually repeating the colour value of each pixel, RLE can be used to shorten this code by representing consecutive pixels of the same colour with a single value that indicates how many times it repeats consecutively:

“3,3,2,1,1,3 (i.e. three “0”/white, followed by three “1”/black, then two “0”/white, etc…)

0,2,5,1,3,2,1

3,1,2,4,4”

(Notice that the 2nd row begins with zero even though the original binary code starts with two “1”/black pixels. This is because in RLE, the first number of each row represents the number of white pixels, so it is necessary to indicate that there are no white pixels at the start)

Image: Lossy vs. Lossless Compression, Image by PCMag (https://www.pcmag.com/encyclopedia/term/lossy-compression)

When deciding which compression method to use, it is important to strike the correct balance between:

Storage & Delivery Limitations – how much space your device has, and the file size limit when sharing content.
Loading Time – this not only affects user experience on the web, but it also plays a role in search engine optimisation (SEO) as Google’s algorithm will prioritise website speed when it is ranking search results.
Quality – there is always a trade-off between file size & quality.

File Sizes

The amount of space a file occupies on a storage device is referred to as its file size. This is determined by how much data the file contains, which can be influenced by factors such as the compression method used, rate of compression, and file format. In the case of images, file size is primarily affected by the number of pixels as each pixel contains information such as its position and colour value. As the amount of data increases, so does the file size, and subsequently, the loading and transfer time as well.

Back to Topic