YOGA - A Handmade Lossless
Image Compression Format

Table of contents:
Foreword
Loading the image
Converting to YCoCg-R
Analysing compressibility
Runlength encoding
Huffman coding
Implementing a look-up table
Writing header data
Writing encoded data
Decoding the image
Analysing performance
Final words

Source on Gitlab! Yogaviewer Kodak example files

Foreword

YOGA is a lossless raster graphics file format written in C++ as a single header file, made solely by me in 2023. I decided to make it as an exercise for to learn how compression and decompression works, with the express goal of being fast while keeping a similar compression ratio compared to PNG.

The results are quite impressive. YOGA is on average 5-10x faster when encoding and about 1.2-2x faster decoding than LIBPNG while keeping a similar compression ratio. Where YOGA shines is compression of larger, detailed images with realistic lighting like photographs and renders where it's not only faster but also creates about 30% smaller images. It performs much more poorly on small images like icons and flat-shaded art however. All of the data and more can be found in the results section.

When I was researching the topic I found it difficult to find clear and concise explanations of how to implement compression and decompression algorithms in a practical way. While there's plenty of resources available on how e.g. Huffman coding works on a theoretical level there's not as much freely available material for programmers who want to make something useful out of it. For that reason I decided to write this article about the steps I took to provide the material I wish I would've had when making it.

Loading the image

Before we begin, it's incredibly useful to know as much as possible about the data we want to compress. Uncompressed images come in a variety of colour depths, typically 4-12 bits of colour data in 1-4 channels, with each channel value together in memory as a "pixel". Images using floating point values do exist, but they're usually reserved for image editing, rendering or specific purposes like HDR photography.

Since memory is 1-dimensional, image data stored there is too. Most image formats store pixels in rows, typically starting from the top (others do it in blocks, chunks). Some image formats like BMP allow negative heights to tell the image reader that the rows are stores from the bottom up.

YOGA is limited to a maximum of 8 bits per colour channel and 4 channels total, but thanks to the compression technique we're using we get the implementation of lower colour depths and fewer channels for free! The width and height will be stored in int32's which is still enough to support a terapixel-sized image but we will limit the total size to 2 GiB for a compressed image since all the functions are single-threaded and it would make more sense to combine multiple images on different threads if you need larger images.

A common approach to compress images is by converting the red, green and blue colours (RGB) to a different colour space and discarding information that the human eye is less sensitive to, called chroma subsampling. There are however several colour transformations that translate back to RGB exactly, keeping the compression lossless, while still allowing the data to be more compressible by reducing randomness.

To write the library we will begin with a colourfully diverse image known as kodim23 from the Kodak Lossless True Colour Image Suite. The single line of black pixels at the bottom has been removed but it's otherwise unedited.

Our first task is to read the PNG image, but for this we will use a library such as stb_image, lodepng, libpng etc. Whichever one we choose we write the data to a struct like this:

struct YOGA_Bitmap_Image
{
	int32_t width;
	int32_t height;
	int32_t channels;
	uint8_t* data;
};

Converting to YCoCg-R

Once the image is loaded we can convert it to a more compressible format. The one that will be discussed here is YCoCg-R. This colour model separate the brightness (Y) of each pixel into an 8-bit unsigned channel and the colour data into two 9-bit signed channels; chroma orange (Co) and chroma green (Cg). (This is where the name "YOGA" comes from, since we're adding an alpha A channel to this). The conversion is losslessly reversible and can be done quickly with the following simple operations:

void RGB_To_YCoCgR
(uint8_t r, uint8_t g, uint8_t b, uint8_t* y, int16_t* co, int16_t* cg)
{
	*co = (r - b) & 0x1FF;
	int16_t tmp = b + (*co >> 1);
	*Cg = (g - tmp) & 0x1FF;
	*y = tmp + (*cg >> 1) & 0xFF;
}

void YCoCgR_To_RGB
(uint8_t y, int16_t co, int16_t cg, uint8_t* r, uint8_t* g, uint8_t* b)
{
	uint32_t tmp = y - (cg >> 1);
	*g = (uint8_t)(cg + tmp);
	*b = (uint8_t)(tmp - (co >> 1));
	*r = (uint8_t)(*b + co);
}

With YCoCgR_To_RGB we can generate the entire colour space for each value of Y. Since we are masking the values the pattern repeats for combinations of YCoCg that couldn't be generated by RGB_To_YCoCgR.

Analysing compressibility

Another reason why we pick YCoCg-R is because it has a very fast conversion that can be done with simple integer math. The fact that we end up with more bits than RGB is okay because the final result is still likely have a lot less entropy and be better compressed. To verify this there is a method that scales linearly with the image size to analyse the total entropy of the image. Let's start by separating the RBG into its component channels into a struct like this:

// NOTE: the channel data is expanded to 
// 16-bit integers here for reasons that are clear later

int16_t* channel_r  = params->channels[0].channel;
int16_t* channel_g = params->channels[1].channel;
int16_t* channel_b = params->channels[2].channel;

for(int32_t i = 0; i < params->pixel_count; i++)
{
	channel_r[i] = bmp_image_data[i * 3 + 0];
	channel_g[i] = bmp_image_data[i * 3 + 1];
	channel_b[i] = bmp_image_data[i * 3 + 2];
}

Next we create a struct to gather occurrence statistics for each possible value of the colours into an array that will allow us to plot them.

struct TFLI_YOGA_Occurrences
{
	int32_t value;
	int32_t occurrences;
};

You can still make out a lot of details for each colour channel as they also encode the colour brightness. This can be seen in the chart as a wide spectrum with low peaks which clues us in on a high entropy. Now let's compare it to the YCoCg-R channels:

Nearly all the contrast is preserved in the brightness channel and if we compare the Y chart with the RGB chart it almost looks like a superposition of the RGB values. The CoCg channels are a lot less uniform, with high peaks and low valleys. The contrast that does remain in CoCg is located in sharp colour transitions like the yellow and blue feathers of the left parrot.

Another thing noticeable about the data in all channels is that in the general case the difference in colour between two adjacent pixels is low. By storing the value difference Δ between pixels in sequence, i.e. the velocity of change, we can most likely reduce the entropy of the data and improve compression rates. If we do that for the original RGB and the converted YCoCg-R data and plot it logarithmically we get these two charts:

The first thing we notice is that we now have very clear and defined peaks around zero, rapidly decreasing towards the edges. We can also clearly see that the average value distribution in ΔCo and ΔCg is significantly smaller than their RGB counterparts while the ΔY channel is about the same.

Thanks to this analysis, we can see that ΔYCoCg-R is a good choice to compress this specific image. But for other images or even other data types like audio or joint animation transforms it might be more beneficial to represent the data as a second-order difference, a.k.a. the acceleration of change. Perhaps the data benefits more from another method like dictionary encoding or even keeping the data in its raw format. If we could quantify the compressibility of the data in different representations, we could choose from any number of implementations for any given input. That's where entropy analysis comes in!

Before we continue, we should define what entropy actually is. Entropy is a measure of how disordered a system is, and in our case it's a measure of the likelyhood that we can successfully predict the next value in a fixed data set. For example, if all the values in our set are the same, there is no entropy because we can successfully guess the value for every position without fail. If all the values are completely random then the entropy is at its maximum since every guess has the same chance of being correct as a dice-roll.

Top row: from high entropy to medium entropy.
Bottom row: from medium entropy to low entropy.

Note that the entropy value of an image isn't related to the quality of it. It also doesn't affect the uncompressed size of the image in any way.

For a data set with known possible values (e.g. 0 to 255) we can calculate a normalised value for entropy H. This is done by first calculating the information value I of an event E, defined as I(E) = -log(P(E)). The probability of an event P(E) gives the information 1 when it is entirely expected. For example when tossing a coin with a 50% chance of either heads or tails, each toss is given by P(E) = 1/2. Because there's two possible values for the coin, the logarithmic base is log₂ and as we can see -log₂(1/2) = 1. Similarly for a perfect six sided die we get -log₆(1/6) = 1. For a 256-sided die we get log₂₅₆(1/256) = 1 etc. and now you see where we're going with this. Due to the "change of base" formula we can also rewrite log₂₅₆(x) = log₂(x)/8.

If we want to measure the given information G of a possible event E_n we take the information value for that event multiplied by the probability of that event: G(E_n) = I(E_n)*P(E_n). To calculate the total amount of entropy H we sum up the given information of each possible event H = Sum_n(G(E_n)). For a perfect coin we get H = 1/2 + 1/2 = 1, the coin is maximally entropic!

If we instead imagine an uneven coin with probabilities of 0.6 and 0.4 (the sum of all probabilities should also equal one) we instead find
H = -log₂(0.6)*0.6 + (-log₂(0.4)*0.4) = 0.97
meaning the coin isn't completely entropic. Similarly for a 99% heads 1% tails coin we find an entropy value of 0.08 and as expected H approaches zero the less random the set of probabilities is.

Let's analyse it for a single value in a single channel. In the image there are 768 * 511 = 392448 pixels and the number of pixels with the value 123 in the red channel is 1768. Therefore the given information of finding a red pixel with that value can be found like this:

P(E) = 1768 / 392448 = 0.005
I(E) = -log₂(0.005) / 8 = 0.974
G(E) = I(E) * P(E) = 0.974 * 0.005 = 0.004

Repeat that for all values that red can take (0-255) and sum them and we find the entropy value H for the red channel to be 0.933! Let's now take a look at all of the representations of the image data we've talked about so far:

In this chart we can see that the RGB channels all have a pretty high entropy, similar to Y, while CoCg are markedly lower with an average reduction in entropy of 10%. We also see that by converting the data to be the difference from the last pixel we get a drastic average reduction of nearly 42% but ΔYCoCg-R still finishes well on top with a 55% average reduction. We could repeat this analysis for other colour models or data representations we'd like to support and pick the one that gives the lowest entropy value, but among these representations we can be confident that we've chosen the lowest one.

Runlength encoding

In any single channel we can infer that the pixels that are close to each other are statistically likely to be the same or nearly the same colour. Think of it like "repeat the previous value once again", or twice, or a thousand times. This is especially true after representing the colour as the difference in value from the previous. When we represent the data as copies or variations of previous data for a set length we call this "runlength encoding" or "RLE".

Here the concept of a "colour value" can be blurred. For example we can consider a repeat value as a new colour value 256. If we repeat it twice that can also be a new value 257. Since the channel value doesn't represent the colour directly anymore we will start calling them codes instead and we call this method "fixed-code" (FC) RLE.

However we immediately see the problem if we would have a unique code when there's many variations of lengths for the repetitions. For large images there might even be more unique codes for RLE than there are colours! Instead we can use a single code as a mark to do a repetition followed by a count. This is called "fixed-length" (FL) RLE.

The issue with fixed-length is of course that you have to tack the count on afterwards and we might end up with more bits in the end than we saved. We have to again analyse our data to optimise the number of codes we dedicate to FC and FL. We do that quite simply by going through each pixel value and comparing it to the last. We can already store the result of this analysis in the data to speed up encoding later.

for(int32_t i = 0; i < params.channel_count; i++)
{
	int32_t same_value_start = 0;
	int16_t same_value_track = -1;
	int16_t same_value_count = 0;
	int16_t* channel = params.channels[i].channel;

	for(int32_t j = 0; j < params.pixel_count; j++)
	{
		if(channel[j] == same_value_track)
		{
			if(!same_value_count)
			{
				same_value_start = j;
			}
			same_value_count++;
		} else
		{
			if(same_value_count)
			{
				// starting at 1 is implied
				same_value_count -= 1;
				channel[same_value_start] = 
					_YOGA_FLAG_VALUE_RUNLENGTH | same_value_count;
				);

				same_value_start = 0;
				same_value_count = 0;
			}
		}

		same_value_track = channel[j];
	}
}

Note: for the sake of visualisation the occurrence count 1 and 0 has been replaced with 1.5 and 1 respectively since log₁₀(1) = 0 and log₁₀(0) = undefined.

In the graph above we can see that most of the duplicate runs here are very short. In fact out of 70040 duplicate runs, 57% (40141) of them only run for a single pixel and 81% runs once or twice. Comparatively there is only two instances of 39 duplicate runs in the Co channel.

We also note that it takes a minimum of 6 bits to represent the value 39 plus the code for FL RLE. But we could also use the codes for 20 and 19 or even 10, 10 and 19 and choose the smallest combination of codes. It's not immediately clear what the optimal choice is here because we don't yet know how long the codes might be, but what is clear is that since the 39 duplicate run only happens twice, it literally doesn't matter which encoding we pick for it.

Let's now imagine a different scenario and take this image of my dog. We run it through the same program and plot the alpha channel:

Compared to the previous example this data is very noisy and can take 441 possible values, for that reason it's probably best to use FL RLE here. In practice however, it's usually better to use some kind of mix of both.

At this point we start to make some assumptions about our data, because it's too expensive to analyse every possible permutation to find the most optimal compression settings. Because we have to store the meaning of every code in the file later, we want to limit the number of them. For that reason, say that if we find more than 31 RLE values we add a FL RLE, everything else will be a FC RLE. That now means that the 32nd code, the FL code, can start from the lowest value it needs to represent!

 1: FC01
 2: FC02
 3: FC03

(...)

29: FC29
30: FC30
31: FC31
32: FL + 0
33: FL + 1
34: FL + 2

We can develop this further to squeeze out some extra bits when compressing, but it's highly dependent on the source data and most likely not worth it for images.

Huffman coding

After we've reduced the entropy in the data as much as possible it's time to pick a compression algorithm to encode it. For this purpose we will use Huffman coding to build a binary tree representation of the data using the occurrence counts. Such a tree is a type of multiply linked list a.k.a. a binary tree, where traversal is done by choosing one of two possible paths from the root. Each entry in the list is called a node and each link can either be a branch to another node or it can be the value represented by the code (also known as a "leaf").

In the image above we can see the possible node constructs to encode the letters A, B, C, D and E in a theoretical tree. In this example we could encode the letter A as a single bit 0b1. From there it follows that the codes for B, C, D and E are 0b110, 0b010, 0b100 and 0b000 respectively. Note that for any given code that isn't A the first bit must be zero since if it wasn't the decoder couldn't tell the difference between an A and a path through the tree. If you are unfamiliar with Huffman coding I can recommend this primer by Tom Scott.

A Huffman tree is constructed starting at the lowest weighted end towards the root, weight in our example meaning number of occurances. In each step we are finding the possible step that gives the lowest total weight sum of the tree until all the values are included. The process of creating the tree is therefore as follows:

1) Take two lowest occuring entries and make them a node.

2) Build the possible node that makes the lowest weight in order of priority
	a) make node from two lowest entries
	b) make node branching to two free standing nodes
	c) make node with one entry and one free standing node

3) Repeat (2) until there are no remaining entries or free-standing nodes

For the case where we have full knowledge of the value set and probability distribution we are guaranteed that each code is given the optimal bit length representation for compression. To store these codes and code-lengths we can add them to our previously-made struct

struct TFLI_YOGA_Occurrences
{
	int32_t value;
	int32_t occurrences;
	int32_t huffman_code;
	int32_t huffman_length;
};

We will also write a TFLI_YOGA_Node struct that will easily allow us to represent the data as a binary tree. Here we will have a TFLI_YOGA_Node for each side as well as the combined_weight which is the sum of the occurrences of the values in the node including all of the child nodes. The simplest approach to create the codes later would be by reading the tree from each end back to the root so we store the parent as well. Lastly it could be a good idea to include which index in the TFLI_YOGA_Occurrences array an end value is referring to if we haven't sorted it back to value order yet when writing the codes.

With all that said we write this:

enum TFLI_YOGA_Node
{
	TFLI_YOGA_NODE_END,
	TFLI_YOGA_NODE_BRANCH,
	TFLI_YOGA_NODE_COUNT
};

struct _YOGA_Node
{
	int32_t value_index[2];
	TFLI_YOGA_NODE_TYPE type[2];
	int32_t data[2];
	int32_t combined_weight;
	int32_t parent;
};

Assuming we have at least two values in the channel, we create the first node. As the list is sorted by occurrences they are easy to find at the end of the list and as we continue we'll step backwards until we reach the root index.

int32_t node_index = 0;
int32_t list_index = list_count - 1;
int32_t largest_list_occurrences = list[0].occurrences;

// grab the two lowest frequency entries
for(int32_t i = 0; i < 2; i++)
{
	nodes[node_index].type[i] = _YOGA_NODE_END;
	nodes[node_index].data[i] = list[list_index].value;
	nodes[node_index].value_index[i] = list_index;
	nodes[node_index].combined_weight += list[list_index].occurrences;

	list_index--;
}

node_index++;
int32_t free_nodes = 1;

I'm calling this a NEW node as we're creating a free-standing node with two ends. The two other types are called TIE to tie two free nodes together and to BUILD an end/branch combination.

After we construct the first node we need to find if a NEW, TIE or BUILD node yields the smallest occurrence sum for the tree. Note that depending on the number of free standing nodes or list entries left not every combination will be possible.

We construct the TIE and BUILD like this:

// tie two lowest nodes together
nodes[node_index].type[0] = TFLI_YOGA_NODE_BRANCH;
nodes[node_index].type[1] = TFLI_YOGA_NODE_BRANCH;
nodes[node_index].data[0] = lowest_node_index;
nodes[node_index].data[1] = second_lowest_node_index;
nodes[node_index].value_index[0] = lowest_node_index;
nodes[node_index].value_index[0] = second_lowest_node_index;

nodes[node_index].combined_weight =
	lowest_node_sum + second_lowest_node_sum;

nodes[lowest_node_index].parent = node_index;
nodes[second_lowest_node_index].parent = node_index;

free_nodes--;

// build on existing node
nodes[node_index].type[0] = TFLI_YOGA_NODE_END;
nodes[node_index].data[0] = list[list_index].value;
nodes[node_index].value_index[0] = list_index;

nodes[node_index].type[1] = TFLI_YOGA_NODE_BRANCH;
nodes[node_index].data[1] = lowest_node_index;
nodes[node_index].value_index[1] = lowest_node_index;
nodes[lowest_node_index].parent = node_index;

nodes[node_index].combined_weight =
	list[list_index].occurrences + lowest_node_sum;

list_index--;

This process is repeated until we run out of value entries and free standing nodes. The last consideration to mention is that when sums are equal the priority order is NEW, TIE and BUILD in that order.

enum CHOICE
{
	INVALID,
	NEW,
	TIE,
	BUILD
};

CHOICE choice = INVALID;

if(pick_new){ choice = NEW; } else
if(pick_tie){ choice = TIE; } else
if(pick_build){ choice = BUILD; } else
if(possible_new){ choice = NEW; } else
if(possible_tie){ choice = TIE; } else
if(possible_build){ choice = BUILD; }

Once the entire tree is constructed we generate the codes with a simple loop finding each end and stepping backwards until we reach the root.

for(int32_t i = 0; i < (node_root + 1); i++)
{
	for(int32_t j = 0; j < 2; j++)
	{
		if(nodes[i].type[j] == TFLI_YOGA_NODE_END)
		{
			int32_t current = i;
			int32_t huffman_code = j;
			int32_t huffman_length = 1;

			while(current < node_root)
			{
				int32_t parent = nodes[current].parent;
				bool result[2] = {};

				if(nodes[parent].type[0] == TFLI_YOGA_NODE_BRANCH)
					result[0] = nodes[parent].data[0] == current;

				if(nodes[parent].type[1] == TFLI_YOGA_NODE_BRANCH)
					result[1] = nodes[parent].data[1] == current;

				TFLR_ASSERT((result[0] != result[1]) || (current == 0));

				int32_t value;

				if(result[0])
					value = 0;
				else
					value = 1;

				huffman_code <<= 1;
				huffman_code |= value;
				huffman_length++;

				TFLR_ASSERT(huffman_length <= 32);
				if(huffman_length > 32) { return YOGA_RETURN_ERROR_CODE; }

				current = parent;
			}

			list[nodes[i].value_index[j]].huffman_code = huffman_code;
			list[nodes[i].value_index[j]].huffman_length = huffman_length;
		}
	}
}

Implementing a look-up table

There's several methods to decode a bit-stream. The simplest method would be to simply traverse the tree bit-by-bit and it's not a bad option as it only requires as much memory as the tree needs and performs reasonably well, but we can do better.

The fastest method is using a lookup-table (LUT) but there is two significant problems with this method. First we need to read or generate this table and store it somewhere. Second we need to ensure we don't stall the CPU by reading from slow memory. The fastest memory accessible by the CPU is the L1 cache and it varies in size but by far the most common size in modern CPUs is at least 32 KiB.

We don't want to store a large LUT in the file itself as it could take up the large part of the file size. At the same time generating it might take more time than it's worth. Reading from a large LUT could also involve so many cache-misses that the performance benefit of using it is almost if not entirely void. The solution therefore is to limit the size and use bit-traversal as an alternative for long codes.

Huffman codes are binary and come only in powers of two so if we keep within the assumed minimum L1 cache size while leaving some headroom we are left with subdividing either 16 KiB for the 1, 2 and 4 channel case or 24 KiB for the 3 channel case. Since we're at liberty to choose the table size we can dynamically scale it down for small images and for larger images it will represent a very small part of the total size (less than 1% for a 2.5 MiB file for the full 24 KiB case).

Using this information we can optimise the use of the table. The largest table size case is 16 KiB for a single channel meaning the longest code we would need to assign to a table is 14 bits long which is representable by 4 bits in the lookup value. We also know each value is represented by a maximum of 9 bits for the Co and Cg channels totalling 13 bits. Depending on what type of data we're encoding we might want to use a 32-bit integer for our LUT to ensure that we have enough room for runlength counts and node indices as we will see we only need 16-bit integers here leaving 3 bits for other purposes.

Firstly we only have data encoded as either YCoCg-R or runlengths, so we assign one bit for this purpose. Secondly we have exactly two processes for reading the data; the LUT and the nodes, so the second bit is assigned for that. The remaining bit we don't have a specific use for at this time, so we will add it to the value bits totalling 10 meaning we can use runlength counts up to 1024 repetitions.

Note however that when we are reading from the nodes we can use the full remaining 15 bits for the offset as we know the bit-length of the table and we know we need to read at minimum another bit to find the right side of the node to read from, meaning we don't need to know if the value is a direct value or a runlength until we reach the end. Given all that, this is how the values in the LUT will be formatted:

0b0ABB'BBCC'CCCC'CCCC
0b1DDD'DDDD'DDDD'DDDD

A - flag ? runlength : colour
B - code length
C - value
D - node offset

Most of the node data we needed to create our codes is useless when decoding so this struct can also be simplified. We only need to know if the node data is a branch or end value and we need to know that value. The value again can be either the colour data or runlength so the node struct and format are made like this:

struct TFLI_YOGA_File_Node
{
	int16_t data[2];
};

0b0A00'00BB'BBBB'BBBB
0b1CCC'CCCC'CCCC'CCCC

A - flag ? runlength : colour
B - value
C - node offset

To give a practical example of how this process works let's imagine we already have the YOGA file ready for decoding. we would grab the int16_t* lookup_table for the channel and try to decode a bit stream:

int32_t masked_bits = bits & table_mask;
int16_t table_value = lookup_table[masked_bits];

We find that table_value = -32704 = 0b1000'0000'0100'0000, so this indicates that we should jump to the node at offset 64

TFLI_YOGA_File_Node* node = &nodes[table_value & 0x7FFF];
bits >>= table_bitlength;
bits_left -= table_bitlength;

Now all that's left to do is repeat until we find the final value

for(;;)
{
	bool bit = bits & 0b1;
	bits >>= 1;
	bits_left -= 1;

	if(node->data[bit] >= 0)
	{
		flag_runlength = node->data[bit] & (0b1 << 14);
		value = node->data[bit] & 0x3FFF;
		break;
	} else
	{
		node = &nodes[node->data[bit] & 0x7FFF];
	}
}

If instead the value would have been read directly from the table we would decode it like this

flag_runlength = table_value & (0b1 << 14);

int32_t length = (table_value & 0x3C00) >> 10;
table_value &= 0x3FF;

bits >>= length;
bits_left -= length;

Writing header data

We've finally gone over everything needed to write the file so all that's left is doing the work. The YOGA header is defined like this:

enum YOGA_TYPE
{
	YOGA_GRAY = 0,
	YOGA_GRAYA = 1,
	YOGA_RGB = 2,
	YOGA_RGBA = 3
};

enum YOGA_COLOUR
{
	YOGA_COLOUR_UNKNOWN = 0,
	YOGA_COLOUR_LINEAR = 1,
	YOGA_COLOUR_SRGB = 2
};

struct YOGA
{
	char magic[4];
	uint16_t version_major;
	uint16_t version_minor;

	int32_t file_size;
	int32_t encoded_byte_size;
	int32_t width;
	int32_t height;

	int8_t channels;
	int8_t type;
	int8_t colour_space;
	int8_t table_bitlength;

	int16_t table_mask;
	uint16_t node_count;

	int16_t* lookup_table[TFLI_YOGA_MAX_CHANNELS];
	uint32_t* encoded_data;
	TFLI_YOGA_File_Node* nodes;

	uint8_t data[8]; // make struct 8-byte aligned
};

Most of these are self-explanatory and we won't go into them further but you might wonder why we only have a single TFLI_YOGA_File_Node* pointer instead of one per channel. The reason is that we will only be storing the parts of the tree that doesn't fit into the LUT, basically pruning the tree into multiple smaller ones for which we store the offset.

To start writing the LUT and nodes we need to know which list entries have codes that exceed the LUT in length, a process as simple as this is enough with our limited value range

int32_t within[1024];
int32_t exceeded[1024];

int32_t within_count = 0;
int32_t exceeded_count = 0;

for(int16_t i = 0; i < count_value; i++)
{
	if(occurance_list[i].huffman_length <= table_bit_length)
	{
		// write values into table
		within[within_count++] = i;
	} else
	{
		// write values as nodes
		exceeded[exceeded_count++] = i;
	}
}

Since every code is unique we can safely write to the table without doing any collision checks. Entries that fit into the table can be much shorter so for these we need to write duplicates into the table. For example say that we have a 4-bit code 0b1001 and we're writing it into a 6-bit wide table. When reading the bit-stream we're reading 6 bits for the lookup so for a 4-bit code we need to get the same result for every variation, i.e. 0b00'1001, 0b01'1001, 0b10'1001, 0b11'1001. The code for this could look something like this:

int32_t variations = (~0) >> (32 - (table_width - code_length));

for(int32_t j = 0; j <= variations; j++)
{
	int32_t table_index = entry->huffman_code | (j << code_length);
	table[table_index] = value;
}

In the case that the length and width are the same we simply write the value into position instead. That leaves the codes that are too long for the table and since we already have the list of them we can find what entry points we need to write to the table. With the number of exceeded code lengths we're dealing with we don't need to bother with anything more complicated than filling in values into an array with a simple loop:

int32_t set_codes = 0;

for(int32_t i = 0; i < exceeded_count; i++)
{
	int32_t code = table_mask & occurance_list[exceeded[i]].huffman_code;
	bool add_code = true;
	for(int32_t j = 0; j < set_codes; j++)
	{
		if(code == unique_codes[j])
		{
			add_code = false;
			break;
		}
	}

	if(add_code)
	{
		unique_codes[set_codes++] = code;
	}
}

If we didn't already incorporate the runlength list into the colour value list we can do that here. The runlength flag value is part of the colour value list and has a code with a length that serves as the base of the runlength codes. We can combine these and do the same as with the colour values but in the rare case that the runlength flag code is too long for the table we need to find it among the written codes and write the entire runlength tree from that point.

runlength_node->data[entry_side] =
	root_index | (int16_t)0x8000; // make branch

TFLI_YOGA_File_Node* write_node = &yoga->nodes[root_index];

for(int32_t i = 0; i < 2; i++)
{
	write_node->data[i] =
		int16_t((root_node->type[i] << 15) | root_node->data[i]);

	if(write_node->data[i] < 0)
	{
		TFLI_YOGA_Write_All_Children(
			yoga->nodes, params->nodes_runlength[channel], root_index, i,
			¶ms->written_nodes, 0x4000
		);
	} else
	{
		write_node->data[i] |= 0x4000; // add runlength flag
	}
}

Writing encoded data

We've already written flags into the channel data to make encoding straight forward so we start by making a buffer to hold the codes we want to encode. Ideally we want to get the full pixel data at once to avoid writing to the same memory multiple times so the code for each channel will follow eachother. Additionally we are assuming that no code for any value is longer than 32 bits.

for(int32_t i = 0; i < yoga_params->pixel_count; i++)
{
	int32_t bytes_to_write = 0;
	int32_t bits_to_write = 0;
	uint64_t buffer[_YOGA_MAX_CHANNELS] = {};
	
	for(int32_t j = 0; j < yoga_params->channel_count; j++)
	{
		int16_t value = yoga_params->channels[j].channel[i];
		// ...

If we have a runlength active we can simply skip writing and reading the colour value as we'll keep track of it when decoding. Otherwise we check for flags and write the correct code into the buffer

if(flag_skip_count[j])
{
	flag_skip_count[j]--;
} else
{
	TFLI_YOGA_Occurrences* list = yoga_params->occurrence_value[j];
	int16_t value = yoga_params->channels[j].channel[i];
	uint64_t code = 0;
	int32_t length = 0;
	// check if value is a flag
	if(value < 0)
	{
		// determine flag type by reading top 4 bits
		TFLI_YOGA_FLAG_VALUE flag = TFLI_YOGA_FLAG_VALUE(value & 0xF000);
		if(flag == TFLI_YOGA_FLAG_VALUE_RUNLENGTH)
		{
			// get count
			int16_t skip_count = value & 0x0FFF;
				// ...

Once we've found the code we can add it to the buffer without no gaps using some pointer and bit arithmetic

uint64_t* write_to_buffer =
	(uint64_t*)((uint8_t*)encoded_values + bytes_to_write);
*write_to_buffer |= code << bits_to_write;
bits_to_write += length;

while(bits_to_write > 7)
{
	bits_to_write -= 8;
	bytes_to_write++;
}

After the codes for each channel have been put into the buffer we can finally write them to the file. The simplest option is to just write single bytes in order, shifting them if necessary and then filling in the remaining bits if there are any.

uint32_t* write_from = (uint32_t*)&encoded_values[0];

while(bytes_to_write)
{
	uint32_t value_to_write = *((uint32_t*)write_from) & 0xFF;
	*write_to |= value_to_write << write_bit_offset;

	write_to = (uint32_t*)((uint8_t*)write_to + 1);
	write_from = (uint32_t*)((uint8_t*)write_from + 1);
	bytes_to_write--;
}

if(bits_to_write)
{
	// uint8_t mask = 0xFF >> (8 - bits_to_write);
	uint32_t value_to_write = *((uint32_t*)write_from); // & mask;
	*write_to |= value_to_write << write_bit_offset;

	write_bit_offset += bits_to_write;
	if(write_bit_offset > 7)
	{
		write_bit_offset -= 8;
		write_to = (uint32_t*)((uint8_t*)write_to + 1);
	}
}

Because we know the image size before decoding we don't need any signal to stop at the end, but we don't necessarily want to end the file at the last possible byte either. If we want to load bits into a buffer using a certain size integer we can pad the file with some zeroes at the end.

file_size += 16; 
file_size = (encoded_image->file_size + 0xF) & 0xFFFFFFF0;
encoded_image = (YOGA*)realloc(encoded_image, file_size);

Decoding the image

We've finally gotten to the point where we can try to debug the encoded image and see if everything is working correctly. It is a good idea to start trying to decode the image even right after the first transformation to YCoCg-R as it will help with finding bugs and errors. After each added transformation (delta, runlength etc.) we add support to decode and we can try to verify these steps with a suite of test images to try to find edge-cases.

We're using a 64-bit system to decode the bit-stream so we create and fill a 64-bit buffer.

uint32_t* read = params.encoded_data;
uint64_t read_value = *(read++);
uint64_t bits = read_value;
read_value = *(read++);
bits |= read_value << 32;
int32_t bit_cache = sizeof(uint64_t) * 8;

We will ensure to always have more or equal to 32 bits of data in the buffer. This way we can read the bit stream without worrying about the amount of bits left in a particular code assuming the longest code is no more than 32 bits.

if(bits_left < 32)
{
	read_value = *(read++);
	bits |= read_value << bits_left;
	bits_left += 32;
}

In the section Implementing a look-up table we talked about the specific implementation for reading the bit-stream and every other transformation is just the same transformation in reverse. If we read a runlength value we just reuse the same value for that channel. To undo the delta transformation we keep track of the last value we found and add the decoded value. Lastly we do the YCoCg-R to RGB conversion and we're done.

Analysing performance

We can compare the size and speed of our code to other file formats and libraries to get an objective metric of how well we've done. We will do this by comparing file size as well as encoding and decoding speed of the QOI test suite. Measuring the file size is easy enough, but we are at the mercy of the operating system when measuring the time it takes to encode and decode.

To start, we have to simply hope that the OS doesn't decide to suddenly interrupt our execution thread. Since we're not using multiple threads this shouldn't be a big issue and with enough data this factor will be minimised.

Secondly we give each library the best possible chance by reading and writing to or from memory that is already loaded with data and available. For STB Image and QOI this requires overriding their memory call macros but for STB Image Write specifically there wasn't an obvious way to do this without rewriting the library. Perhaps this is why it's performing so bad at web screenshots as those images are much larger than typical images and it probably runs into a bunch of realloc calls that requires memory to be copied.

Finally we are also relying on the OS to provide us with an accurate timer. In the benchmark source you can see that both the time and the cycle count is tracked. The cycle counts are not presented here but can be used to double check if the times seem accurate, which they are.

With all of that said, the tables show the data as measured while the graphs are presented as relative to YOGA. This is simply done to make it easier to tell the difference, for example a time value of 5 means that the library performed 5x slower than YOGA, while values below 1 indicate it being faster. All libraries use the default settings.

Decode time (ms)	libpng	STB Image	QOI	YOGA
icon_64	143.64	138.62	6.09	201.27
icon_512	4,019	3,102	239.78	1,466
kodak	2,725	1,533	87.50	351.43
tecnick	50,838	22,173	1,299	4,796
wikipedia	17,544	8,385	535.74	1,894
pngimg	54,483	32,402	1,904	8,887
game	80,466	37,560	3,078	14,294
web	2,022	6,924	109.22	526.08
photo	5,149	3,009	164.37	637.50
pk	8,772	4,346	416.95	1,831
pk01	2,858	2,061	75.72	338.76
pk02	26,520	10,916	576.65	2,248
plants	8,431	6,663	263.57	1,361
Total	263,973	139,214	8,756	38,832

Here we can see that YOGA is much faster than libpng and STBI in every circumstance except small images, on average almost 7 times faster than libpng. Without looking into the implementation, there is probably more analysis of the data before compression starts and less analysis during compression, which is creating this discrepancy. QOI is blowing all other libraries out of the park though, being a factor of 5-100 times faster, almost as fast as just writing bitmap data directly.

Decode time (ms)	libpng	STB Image	QOI	YOGA
icon_64	40.08	14.97	4.46	16.48
icon_512	413.56	333.42	138.09	364.04
kodak	157.75	160.47	67.23	122.03
tecnick	2,030	2,514	1,043	1,768
wikipedia	832.48	978.88	396.77	727.40
pngimg	3,755	3,821	1,217	2,688
game	4,166	4,302	2,055	4,445
web	630.98	475.31	63.68	151.70
photo	324.79	365.20	142.45	232.59
pk	526.31	367.74	334.44	566.38
pk01	240.81	231.15	55.78	100.17
pk02	1,144	1,213	460.52	763.05
plants	735.16	720.25	168.05	385.93
Total	14,997	15,498	6,147	12,331

Decoding data is much more uniform in its result between the libraries, as is expected since they all work on decoding a stream of bits using data from the header. Still YOGA is about the same if not faster in nearly every case compared to libpng and STBI, being on average 20-25% faster, but QOI is still the clear winner in terms of speed.

File size (kB)	libpng	STB Image	QOI	YOGA
icon_64	807.67	948.05	978.26	1,369
icon_512	10,892	14,775	18,221	19,008
kodak	14,231	20,957	14,735	12,938
tecnick	211,404	305,452	252,777	209,426
wikipedia	85,144	122,474	99,661	81,851
pngimg	208,158	322,105	245,886	200,509
game	259,133	364,998	320,741	303,521
web	9,008	43,073	9,720	9,108
photo	36,724	46,697	39,638	27,635
pk	78,757	58,797	74,178	75,909
pk01	13,010	26,254	14,232	12,616
pk02	96,957	146,541	108,482	87,964
plants	37,322	71,470	39,914	34,810
Total (MB)	1,037	1,508	1,210	1,051

Perhaps the most important statistic when it comes to compression is the compression ratio. Size matters and unfortunately STBI web doesn't even fit on the graph, towering above at a ratio of 4.73 times larger than YOGA. Overall YOGA and libpng have a very similar size, with YOGA being just a fraction smaller in some practical cases like photos and textures. Overall libpng still wins overall and if you change the compression ratio to be 9 instead of the default 6 it might also be smaller for those situations.

Final words

YOGA is definitely the most complex and difficult software I have tried to write to-date. To start, I only spent about 10 hours learning theory in the same way as most computer science students would be learning it. But after that, I had to spend over 100 hours programming a working prototype before I could even export an entire file, because the compression required so many steps to get to the point where you can start decompressing it again. Then it was the nightmare of debugging the almost 3000 lines of code where every line can have a profound impact on the result. Unless you have a special interest, compression is not really a rabbit hole I would recommend but it has taught me a lot, not only about compression but also shipping code and designing an API. Creating this article about it has also been interesting albeit a lot of work on its own.

There are things I would consider to improve, especially to improve the performance of the small icons. YOGA also doesn't do any multicolour analysis at all, nor does it check if YCgCo actually makes the image more compressible. This probably explains partly why it does so well at textures and photos but not icons, for example. Another thing I've been thinking about is subdividing the image into chunks to better optimise the compression of that chunk, but that would take a complete rewrite of the library and is not something I'm interested in developing at the moment.

So for the foreseeable future I am satisfied with the performance of it and I will use it for my own projects moving forward with the satisfaction of knowing I made it completely on my own. If you have read all of this, thank you and let me know if you have any feedback!

YOGA - A Handmade LosslessImage Compression Format