PySmaz a Pure Python Small-String Compression Library Ported from SMAZ

PySmaz is a pure python port of the SMAZ small string compression library originally developed by Salvatore Sanfilippo. I thought I’d just jot down a few notes about the experience of porting the library from C to Python. In particular the challenges around optimizing both the absolute compression ratio of the algorithm, ensuring throughput is acceptable for its intended use, while preventing the code from descending into an unreadable mess.

I was faced with an interesting challenge recently, an ‘instigator’ process was generating a large and nearly identical list of strings (object names), and then distributing them as a work list to a cloud of worker processes. The problem was that the number of objects had gone up by a factor of ten, so the sheer number of object names was causing the 32bit instigator to run out of memory.

I toyed briefly with the idea of using string compression, but very rapidly came to the conclusion that this would only give us a 50% gain, which would soon be eaten by further growth. What we needed was to change the order of the problem entirely, which we did by changing the algorithm. Additionally we brought the run-time down by a factor of ten which was just gravy.

However, when considering string compression I experimented with existing bz2 and zlib compression, and like many others before me, found that they tended to make small strings much bigger. These algorithms are entropy coders, looking for symbols, blocks, and patterns. A small string doesn’t give them enough data to understand it before it ends.

I quickly found SMAZ, which seemed to promise good compression of small strings, while at the same time being straightforward enough to port to Python.

SMAZ uses a 253 entry dictionary of common sub strings combined with 2 extra codes:

  • 254 – indicates the next byte is a single byte unencoded value
  • 255 – indicates the next byte contains a length, and then the bytes up to that length as unencoded values

Getting Started

I decided to experiment with a pure Python implementation, this would allow something like Pypy (a JIT compiler for Python), to get optimal performance, while providing a very portable library.

My first attempt at the implementation was as direct a port of the C version as I could muster. Throughput in regular CPython was abysmal – barely 175k/sec of compression. Like the cloud worker problem, I didn’t need a simple 50% fix here, but an order of magnitude improvement that could only come from a different algorithm.

Improving Throughput

I started looking at the C algorithm in detail, several things became apparent:

  • The inner loop for each byte in the input considered up to length 7 matches starting with the longest first. An encoding table miss would only be discovered after checking 7 characters. This implied performance was at least O(n*7)
  • At each character length an encoding table was searched for matches, this table was hashed into based on the character content of the input. The entry was then searched linearly with approximate average length of 2, implying O(n*7 *2)

To improve throughput it would be necessary to reduce the constant factor of 7 * 2 down to a better level.

Immediately a tree felt like the right data structure to help with this. Throwing away the encoding table I built a simple tree structure that at each node had:

  • A dict of child nodes, mapped by the byte value
  • An encoded value that represents the path to this point (if available) or None

In building the tree I created a make_tree function that builds an encoding tree based on the decoding table, allowing for easy substitutions of input dictionaries.

The encoding inner loop could now be a simple loop advancing forwards character by character into the encoding tree, and exiting as soon as a match was impossible. The improvement was dramatic, with throughput up to one megabyte per second. The new constant is a little hard to calculate, but the average depth of the tree is 2.34 seems to tie out with the observed performance difference (175k/sec * 14 / 2.34) ~= 1 megabyte per second.

Improving Compression

Now stepping back and looking at the algorithm, it greedily encodes characters as entries in the encoding table, but the cost of switching from raw encoding (using value 255) to a table value is sometimes results in a worse overall encoding. A simple example of this is the following string:

'@ @ @ '

Since @ is not in the encoding table, and space is, the algorithm ends up producing an encoding of:

[254-Single raw symbol][@][Encoded symbol ' '][254-Single raw symbol][@][Encoded symbol ' '][254-Single raw symbol][@][Encoded symbol ' ']

So instead of six bytes, we have nine.

We can improve this by recognizing that the alternative to encoding is to merely encapsulate the string thus:

[255-Raw symbol run][Length: 6][@][ ][@][ ][@][ ]

Which gives us eight bytes.

Since the worst case growth (by encapsulation) for a given length string can be computed – it is trivial to add a check at the end of the encoding run to see if the encoded size has exceeded that growth rate. I call this ‘pathological case checking’.

However, this doesn’t stop pathological runs emerging inside larger strings where the overall compression ratio is better than worst case. To improve this, we must use a different strategy.

Since we know that the pathological case arises on transitions between encoded and raw runs, it was worth adding logic to validate that a transition makes sense. I considered using a path-finding algorithm to explore the compression state space, however, since it’s not possible to easily simply calculate the cost of a encoded run of a portion of the string, it would require a great deal of re-encoding of the same material, pushing throughput down dramatically.

I settled on a simpler heuristic that can be basically summarized as follows. Keep a record of a potential position (a backtrack marker) at which a transition between modes was made, and record encoded bytes between there and the current position, the next time we hit a transition, look at the encoded length and compare that to the worst case encapsulated length. Then:

  • Where the encoded length is clearly better, flush the encoded bytes to output and move the backtrack marker forward to the current position.
  • Where the gain is neutral or slight, hold the backtrack marker in place, and continue forward. Maybe the gain or loss will become clearer in the next transition.
  • Where the encoded length is clearly worse, throw away the encoded bytes, and place the raw bytes into an unmatched buffer for later encoding, and move the back track marker forward.

This seems to catch most of the pathological sub-string issues, however it won’t spot a few very small pathological cases, so the final pathological case check is still required.

In the real world the backtracking logic makes an approximately 1% to 8% improvement to large strings compression, which is slightly useful.

Performance

SMAZ really shines on small strings, the setup costs for the other algorithms (in Python at least) dwarf SMAZ, resulting in relatively high throughput, which is massively exaggerated for small strings. Oddly PyPy performance for small strings is actually higher than for longer strings, I suspect that lists growth in PyPy becomes a limiting factor.

Strings From 1 to 8 Bytes

                     SMAZ(CPython)  SMAZ(PyPy)         bz2          zlib
   Comp   throughput  0.5 mb/s       4.0 mb/s     0.2 mb/s     0.43 mb/s
   Decomp throughput  1.4 mb/s      14.0 mb/s     0.5 mb/s      2.6 mb/s

Five Megabyte Strings

With five megabyte strings, the gloves are really off. Zlib’s performance is very impressive, and the gains of PyPy over CPython are clear to see. Clearly you wouldn’t choose SMAZ for compressing anything over around a few hundred bytes.

                     SMAZ(CPython)  SMAZ(PyPy)         bz2          zlib
   Comp   throughput  0.9 mb/s       2.0 mb/s     2.0 mb/s     74.0 mb/s
   Decomp throughput  3.6 mb/s      16.5 mb/s    30.3 mb/s    454.6 mb/s

Common Text-Compression Corpus

Again the small strings message comes through strongly. Individually compressed SMS messages and URLs really are the big win here.

                      Original   SMAZ**     bz2    zlib SMAZc *** SMAZcp ****
   NUS SMS Messages *  2666533 1851173  4106666 2667754  1876762     1864025
   alice29.txt          148481   91958    43102   53408    92405
   asyoulik.txt         125179   83762    39569   48778    84707
   cp.html               24603   19210     7624    7940    19413
   fields.c              11150    9511     3039    3115    10281
   grammar.lsp            3721    3284     1283    1222     3547
   lcet10.txt           419235  252085   107648  142604   254131
   plrabn12.txt         471162  283407   145545  193162   283515
   ACT corpus (concat) 4802130 3349766  1096139 1556366  3450138
   Leeds URL corpus *  4629439 3454264  7246436 5011830  3528446     3527606
   * Individually compressed (i.e. small strings)
   ** SMAZ with back-tracking
   *** SMAZ classic (original algorithm)
   **** SMAZ classic with pathological case detection

Making Friends and Influencing PyPy

One of the objectives of writing the project in pure Python was to ensure broad compatibility. Once I started testing with PyPy I immediately found an issue with my first implementation. A unit test that exercised a five megabyte string took forever to execute. Since the implementation at that stage relied on string concatenation for most of its work, it effectively had O(n ^ 2) performance in PyPy due to the O(N) performance of string appends. CPython on the other hand has O(1) performance for string appends, so I didn’t see this issue.

Resolving it was simply a question of switching from using strings to using lists of characters. A few tweaks were necessary to ensure that all the lists only ever contained single characters so that len() always returned the number of characters in the list.

Happily, the change resulted in an small extra performance boost in CPython too.

Real World Usage

I would recommend SMAZ for compression if you have something like an in memory database where you want random access to a large number of strings. The alternative might be to use multi-kilobyte blocks of strings compressed with zlib, LZ4 or Snappy, and simply store the pointers into the uncompressed block.

Analyzing the relative performance of the two approaches – assuming 50 byte strings. If you and that PyPy is being used – that gives random access to 280K strings per second. Zlib performance will depend on block size:

   
   Text block size                     2kb  4kb   8kb  16kb
   Random access strings per second*: 232k  116k  58k  29k

*Calculated performance - assumes peak Zlib performance which may not be the case

I suspect 1kb blocks would probably result in too much setup cost. So SMAZ (or something similar) seems sensible to reach for in this situation.

I had a quick look at LZ4 which has blistering throughput (seriously – almost unbelievable) even with small strings, but on 50 character strings it resulted in 10% growth on average. So it’s not really applicable for small strings. Unfortunately I suspect that random access kills it just as dead as zlib though less so.

I tried to get Snappy to work – but failed. I’m sure with a bit of poking around I can make it happen (Python binding issues), but for the moment I am going to assume that performance is comparable to LZ4.

C-Bindings

Of course directly binding Python to the original library gets you even better performance. Unfortunately my experiments with the existing ‘smaz’ python bindings were a bit problematic (buggy). Longer strings appeared to be truncated, while shorter strings appeared to lose trailing characters, if we ignore these problems, throughput for the C library was 20x that of cPython and approximately 10x that of Pypy.

Of course the original C-library would benefit from many of the performance improvements mentioned above.

Conclusions

I believe SMAZ represents a very interesting choice when looking for a library to compress individual strings in memory datasets that require very find grained random access.

I greatly enjoyed the exercise of porting this small library. In general it reinforced the following lessons:

  • The choice of algorithm is where the greatest performance gains will come from
  • Favor built-in functions in Python as much as possible i.e. len(list) over using a counter
  • The fewer the lines of Python, the faster
  • Unit-tests. Nuff’ said.

The PySmaz package is available on PyPI the Python Package repository as well as on github.