Sunday, February 8, 2015

Reducing Space Usage in Redis for UUIDs

One of the primary concerns when setting up a data store is figuring out how much space it will need to cover your use case, relatedly how it will perform on datasets of the sizes you expect, and when you will need consider engaging advanced features like sharding to handle your load. With this in mind, I was thinking about how to spec a system the other day for storing large amounts of data in Redis, keyed for the most part by UUID. Most of the values associated with these keys were either UUIDs as well, or relatively tiny data types like booleans and integers. Given these aspects of the intended system, and combined with the standard practice of storing data in Redis as UTF-8 strings, it was immediately apparent to me that there were some easy gains to be had.


How to turn 36 bytes into 16


The data type of Redis keys is binary-safe string, so the majority of libraries simply execute toString() or whatever your language equivalent is on your key and call it a day. It occurred to me that the majority of the space this data store occupied would be taken up by these string-encoded UUIDs, which has this canonical form:
de305d54-75b4-431b-adb2-eb6b9e546014
As a UTF-8 string, this form will take up one byte per character or 36 bytes, 32 for informational characters and 4 for dashes. What a waste! A UUID is simply a 16 octet (16 byte) integer. The maximum value of this number is 2^128:
340282366920938463463374607431768211456
As a UTF-8 string, this would be 39 bytes. Not only is that worse than 36 bytes, it's pretty close to the average case integer representation (2^128/2). But! Who says we have to represent this as a human-readable string of bytes? If we represent the UUID as a byte array, it is of course only 16 bytes:
��£¬¾ý�ȹɀɱ��ʜͶϋ͍
Boom. We just reduced our data store size on the order of 50%. It's not going to be as good as 16/36 = 55% reduction because there's overhead for every KV pair and there's potentially some other data stored (boolean, int, etc.), but that's still a huge win. Now my single instance of Redis can last me twice as long (assuming a linear growth curve) before I need to worry about sharding, etc. Normally I wouldn't recommend relying on anything less than an order magnitude gain for an architectural decision, but this is really just a thought experiment :)

Are you crazy?


Obviously this does have some drawbacks. If I'm debugging some production issue and the logs say there's a problem with ID xxx-xx-xx, I can't just fire up redis-cli and get the value of $keyPrefix-xxx-xx-xx. I'm going to have to use some custom tooling to convert my logged ID to a byte array, add my prefix and run the command I want against that. This is a non-trivial cost, but for a use-case in which the primary data stored is UUIDs we're talking about a ~55% reduction in space (16/36 bytes). If I can support a userbase & dataset that is more than twice as large on the same hardware I consider that a win, especially since the primary limiting factor in most projects is IO bottlenecking. That said, I probably wouldn't do this in the real world because the value of being able to debug quickly and get other developers up-to-speed quickly usually exceeds the value of performance gains less than 10x.

Let's go further


Alright so we've already established that I'm crazy. How far can we take it? There's one other obvious target for reducing wasted space in Redis KV pairs - the key prefix. Normally we namespace keys to keep them from colliding, so we might have some keys that look like (assuming non-binary UUIDs):
     standardUser-de305d54-75b4-431b-adb2-eb6b9e546014
groupNotification-de305d54-75b4-431b-adb2-eb6b9e546015
   messageContent-de305d54-75b4-431b-adb2-eb6b9e546016
For these keys we're using about 16 bytes each for the function of namespacing. With 16 bytes we could represent 2^128 namespaces! (Conveniently the same size as a UUID :) How many namespaces do we really need for keys? 256? 65536? Let's go with that. 2 bytes as a byte array gives us our 65k prefixes. We store these in an enum somewhere in the common lib for our project and bingo, we have a way to reference our keys 77.5% more space efficiently:
�¬-de305d54-75b4-431b-adb2-eb6b9e546015
ɱ�-de305d54-75b4-431b-adb2-eb6b9e546015
�Ͷ-de305d54-75b4-431b-adb2-eb6b9e546015
When we combine the two approaches together, we turn an average key length of 16 + 1 + 36 = 53 bytes into 2 + 1 + 16 = 19 bytes, for an average savings of 64%. Awesome:
�¬-��£¬¾ý�ȹɀɱ��ʜͶϋ͍
ɱ�-��£¬¾ý�ȹɀɱ��ʜͶϋ͍
�Ͷ-��£¬¾ý�ȹɀɱ��ʜͶϋ͍

Yea, this is crazy


If you go take a gander at the Redis intro to data types, it mentions the following about best practices for keys:
"Very short keys are often not a good idea. There is little point in writing "u1000flw" as a key if you can instead write "user:1000:followers". The latter is more readable and the added space is minor compared to the space used by the key object itself and the value object. While short keys will obviously consume a bit less memory, your job is to find the right balance."
Bah humbug. As much as I hate to admit it, this is right, the grist just isn't worth the grind. You win this time antirez!

That said - in my next post I'm still going to implement an extension to Scredis to do this anyway, just for kicks.