Redis Hashsets - Performance Analysis & Numerical Key Compression

Recently I read a pretty interesting blog post by Mike Krieger, co-founder of Instagram. The summary of the post was they were faced with an interesting challenge in storing 300 million key value pairs for consumer to photo resolution. They even went as far as to write an interesting test in python and posted it on github gist.

I wanted to see if I could improve a bit on his memory allocation in the redis hash so I set out to write an algorithm that might help further reduce the overhead.

I was using groovyconsole to run all of these scripts.

Here is a comparison of using a traditional numerical id to one of my compressed ids:

def r = new Random() def num = r.nextInt(12000000).toString() println num println num.bytes.length println encodeNumber(num, getEncoder()) println encodeNumber(num, getEncoder()).bytes.length def getEncoder(){ ['1','2','3','4','5','6','7','8','9','0','-','=','!','@','#','$','%','^','&','*','(',')','_', '+','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','[',']', '\\',';','\'',',','.','/','{','}','|',':','\"','<','>','?','`','~',' ','ä','Ä','Ü','Ö','ü'] } String encodeNumber(number, encoder){ String encoded = "" String numberString = number.toString() numberString = ((numberString.length() % 2) != 0 ? "0$numberString" : numberString) Integer token = numberString.length()-1 while(token >= 0){ encoded += encoder[numberString[token-1..token].toInteger()] token-=2 } encoded }

By adding the list lookup and key transformation it does add some additional time over using plain numeric keys. For the reverse transformation I played around with using both a list and a map. I think it is pretty clear the map is quite a bit faster in general for decoding. I did however see the performance of the redis operations not reflect this speed increase since I am only doing an encode on the values in that script. Below is the code to test the speed differences in encoding and decoding with lists and maps.

The results I saw were consistent with the following:

5000 encode map took 0.0286 ms each on avg and 143 ms total
5000 decode map took 0.0496 ms each on avg and 248 ms total
5000 encode list took 0.0186 ms each on avg and 93 ms total
5000 decode list took 1.5004 ms each on avg and 7502 ms total

def tests = 5000 def enm = {n, e-> encodeNumberMap(n, e).toString() } def dnm = {n, e-> decodeNumberMap(n, e).toString() } performTest("map", tests, getEncoderMap(), getDecoderMap(), enm, dnm) def en = {n, e-> encodeNumber(n, e).toString() } def dn = {n, e-> decodeNumber(n, e).toString() } performTest("list", tests, getEncoder(), getEncoder(), en, dn) def performTest(name, tests, encoder, decoder, encd, decd){ def decodeTime = [], encodeTime = [] def encodedNumber, decodedNumber, number def random = new Random() tests.times { number = Math.abs(random.nextLong()) encodeTime << benchmark { encodedNumber = encd.call(number, encoder) } decodeTime << benchmark { decodedNumber = decd.call(encodedNumber, decoder) } //println "$number == $decodedNumber using coded $encodedNumber" assert number.toString() == decodedNumber } println "$tests encode $name took ${(encodeTime.sum() / encodeTime.size())} ms each on avg and ${encodeTime.sum()} ms total" println "$tests decode $name took ${(decodeTime.sum() / decodeTime.size())} ms each on avg and ${decodeTime.sum()} ms total" println "" } def getEncoder(){ ['1','2','3','4','5','6','7','8','9','0','-','=','!','@','#','$','%','^','&','*','(',')','_', '+','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','[',']', '\\',';','\'',',','.','/','{','}','|',':','\"','<','>','?','`','~',' ','ä','Ä','Ü','Ö','ü'] } def getDecoderMap(){ ['1':0,'2':1,'3':2,'4':3,'5':4,'6':5,'7':6,'8':7,'9':8,'0':9,'-':10,'=':11,'!':12,'@':13,'#':14,'$':15,'%':16,'^':17,'&':18,'*':19, '(':20,')':21,'_':22,'+':23,'a':24,'b':25,'c':26,'d':27,'e':28,'f':29,'g':30,'h':31,'i':32,'j':33,'k':34,'l':35,'m':36,'n':37,'o':38,'p':39,'q':40, 'r':41,'s':42,'t':43,'u':44,'v':45,'w':46,'x':47,'y':48,'z':49,'A':50,'B':51,'C':52,'D':53,'E':54,'F':55,'G':56,'H':57,'I':58,'J':59,'K':60,'L':61, 'M':62,'N':63,'O':64,'P':65,'Q':66,'R':67,'S':68,'T':69,'U':70,'V':71,'W':72,'X':73,'Y':74,'Z':75,'[':76,']':77, '\\':78,';':79,'\'':80,',':81,'.':82,'/':83,'{':84,'}':85,'|':86,':':87,'\"':88,'<':89,'>':90,'?':91,'`':92,'~':93,' ':94,'ä':95,'Ä':96,'Ü':97,'Ö':98,'ü':99] } def getEncoderMap(){ //reverse map for encoder, probably would create the map manually for less execute time Map encoderMap = [:] getDecoderMap().collect{ encoderMap.put((it.value.toString().length() == 1 ? "0${it.value}".toString() : it.value.toString()), it.key) } encoderMap } def decodeNumber(number, encoder){ String decoded = "" Integer num = 0 String numberString = number.toString() Integer token = numberString.length()-1 while(token >= 0){ encoder.eachWithIndex{o,i-> if(o == numberString[token]) num = i return } decoded = decoded + (num.toString().length() == 1 ? "0$num" : num) token-=1 } (decoded.startsWith("0") ? decoded[1..decoded.length()-1] : decoded) } String encodeNumber(number, encoder){ String encoded = "" String numberString = number.toString() numberString = ((numberString.length() % 2) != 0 ? "0$numberString" : numberString) Integer token = numberString.length()-1 while(token >= 0){ encoded += encoder[numberString[token-1..token].toInteger()] token-=2 } encoded } String decodeNumberMap(number, map){ String decoded = "" String numberString = number.toString() Integer num = 0 Integer token = numberString.length()-1 while(token >= 0){ num = map[numberString[token]] decoded = decoded + (num.toString().length() == 1 ? "0$num" : num) token-=1 } (decoded.startsWith("0") ? decoded[1..decoded.length()-1] : decoded) } String encodeNumberMap(number, map){ String encoded = "" String numberString = number.toString() numberString = ((numberString.length() % 2) != 0 ? "0$numberString" : numberString) Integer token = numberString.length()-1 while(token >= 0){ encoded += map[numberString[token-1..token].toString()].toString() token-=2 } encoded } Long benchmark(closure) { def start = System.currentTimeMillis() closure.call() def now = System.currentTimeMillis() now - start }

I wrote a similar test as the one provided by Mike https://gist.github.com/1329319 using compressed ids instead of the full numerical ids and was able to save an additional 18% in space going from 17mb/1,000,000 keys to 14mb/1,000,000 keys. I probably could squeeze a little more efficiency out of the memory and redis hash if I ran both the first and second ids through the compression methods. For the space savings to be really worth while I am assuming the number would need to be at least 4 digits long as the compression would only be a byte going from 3 to 2 (33%) as opposed to going from 4 to 2 (50%). Five and six digit ids will both reduce to 3 chars, seven and eight to 4, etc. I am doing a 2->1 reduce and have to pad the odd length keys with a 0 in effect giving you (n) and (n-1) having the same number of bytes in memory (where n = all even length keys).

Here is the complete script I wrote to test the id space reduction hypothesis. I would certainly welcome any feedback on additional improvements that can be made to the code. I think there are probably further reductions in memory to be had.

@Grapes([ @Grab('redis.clients:jedis:1.5.1'), @GrabConfig(systemClassLoader=true) ]) import redis.clients.jedis.* performTest("unencoded", null) {n, e-> n.toString() } performTest("encoder map", getEncoderMap()) {n, e-> encodeNumberMap(n, e).toString() } performTest("encoder list", getEncoder()) {n, e-> encodeNumber(n, e).toString() } def performTest(name, encoder, closure){ Random random = new Random() Integer NUM_ENTRIES = 1000000 Integer MAX_VALUE = 12000000 Integer COMMIT_FREQUENCY = (NUM_ENTRIES/10) Jedis jedis = new Jedis("localhost") jedis.flushAll() Pipeline p = jedis.pipelined() println "jedis used memory at start : ${getRedisProperty(jedis, 'used_memory')} bytes" println "testing $name size of putting $NUM_ENTRIES objects into redis hash" def duration = benchmark { NUM_ENTRIES.times { number = Math.abs(random.nextInt(MAX_VALUE)) bucket = Math.floor(number / 500) p.hset(bucket.toString(), it.toString(), closure.call(number, encoder)) if((it % COMMIT_FREQUENCY) == 0){ p.execute() p = jedis.pipelined() } } p.execute() } println "redis used memory at after $name hset operation : ${getRedisProperty(jedis, 'used_memory')} bytes" println "took $duration ms to complete $name" println "" } def getEncoder(){ ['1','2','3','4','5','6','7','8','9','0','-','=','!','@','#','$','%','^','&','*','(',')','_', '+','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','[',']', '\\',';','\'',',','.','/','{','}','|',':','\"','<','>','?','`','~',' ','ä','Ä','Ü','Ö','ü'] } def getDecoderMap(){ ['1':0,'2':1,'3':2,'4':3,'5':4,'6':5,'7':6,'8':7,'9':8,'0':9,'-':10,'=':11,'!':12,'@':13,'#':14,'$':15,'%':16,'^':17,'&':18,'*':19, '(':20,')':21,'_':22,'+':23,'a':24,'b':25,'c':26,'d':27,'e':28,'f':29,'g':30,'h':31,'i':32,'j':33,'k':34,'l':35,'m':36,'n':37,'o':38,'p':39,'q':40, 'r':41,'s':42,'t':43,'u':44,'v':45,'w':46,'x':47,'y':48,'z':49,'A':50,'B':51,'C':52,'D':53,'E':54,'F':55,'G':56,'H':57,'I':58,'J':59,'K':60,'L':61, 'M':62,'N':63,'O':64,'P':65,'Q':66,'R':67,'S':68,'T':69,'U':70,'V':71,'W':72,'X':73,'Y':74,'Z':75,'[':76,']':77, '\\':78,';':79,'\'':80,',':81,'.':82,'/':83,'{':84,'}':85,'|':86,':':87,'\"':88,'<':89,'>':90,'?':91,'`':92,'~':93,' ':94,'ä':95,'Ä':96,'Ü':97,'Ö':98,'ü':99] } def getEncoderMap(){ //reverse map for encoder, probably would create the map manually for less execute time Map encoderMap = [:] getDecoderMap().collect{ encoderMap.put((it.value.toString().length() == 1 ? "0${it.value}".toString() : it.value.toString()), it.key) } encoderMap } def decodeNumber(number, encoder){ String decoded = "" Integer num = 0 String numberString = number.toString() Integer token = numberString.length()-1 while(token >= 0){ encoder.eachWithIndex{o,i-> if(o == numberString[token]) num = i return } decoded = decoded + (num.toString().length() == 1 ? "0$num" : num) token-=1 } (decoded.startsWith("0") ? decoded[1..decoded.length()-1] : decoded) } String encodeNumber(number, encoder){ String encoded = "" String numberString = number.toString() numberString = ((numberString.length() % 2) != 0 ? "0$numberString" : numberString) Integer token = numberString.length()-1 while(token >= 0){ encoded += encoder[numberString[token-1..token].toInteger()] token-=2 } encoded } String decodeNumberMap(number, map){ String decoded = "" String numberString = number.toString() Integer num = 0 Integer token = numberString.length()-1 while(token >= 0){ num = map[numberString[token]] decoded = decoded + (num.toString().length() == 1 ? "0$num" : num) token-=1 } (decoded.startsWith("0") ? decoded[1..decoded.length()-1] : decoded) } String encodeNumberMap(number, map){ String encoded = "" String numberString = number.toString() numberString = ((numberString.length() % 2) != 0 ? "0$numberString" : numberString) Integer token = numberString.length()-1 while(token >= 0){ encoded += map[numberString[token-1..token].toString()].toString() token-=2 } encoded } String getRedisProperty(jedis, prop){ Map map = [:] jedis.info().split('\n').each {param -> def p = param.split(":") map[p[0]] = p[1] } map[prop] } Long benchmark(closure) { def start = System.currentTimeMillis() closure.call() def now = System.currentTimeMillis() now - start }

Christian Oestreich

two guys in design - software.development.professional

Redis Hashsets - Performance Analysis & Numerical Key Compression

Comments