This commit splits the length calculation of the resulting string and the
actual encoding into two new functions. This makes it possible to e.g.
encode utf8 directly into a buffer.
The length calculation has been rewritten for different shift sizes. For
8bit strings it uses a popcount loop, which calculates the number of
high bits (code points bigger than 0x7f) on machine size chunks. On
machines which have popcount instructions this is much faster.
With compilers which do not support __builtin_popcount it uses a simple
For 16bit and 32bit strings the length calculation uses clz to count the
number of bits in the codepoint to calculate the length without branches.
The encoding function is split into one version for each shift size.
For 32bit strings it avoids branches by using the resulting byte
lengths as a jump size. This generates reasonable code, at least in gcc.
Benchmark results on my i7:
utf8/code.pike#encode_7bit | 1.3 G 1.6 % | 8.3 G 3.4 % |
utf8/code.pike#encode_8bit | 651.1 M 1.8 % | 1.1 G 1.2 % |
utf8/code.pike#encode_arabic | 498.4 M 0.8 % | 710.3 M 1.2 % |
utf8/code.pike#encode_bulgarian | 488.2 M 1.2 % | 688.4 M 2.6 % |
utf8/code.pike#encode_estonian | 614.8 M 6.6 % | 969.5 M 1.5 % |
utf8/code.pike#encode_hebrew | 496.9 M 1.8 % | 710.1 M 1.0 % |
utf8/code.pike#encode_japanese | 704.9 M 4.0 % | 785.4 M 1.6 % |
utf8/code.pike#encode_polish | 388.9 M 0.4 % | 710.1 M 1.3 % |
utf8/code.pike#encode_thai | 642.8 M 3.3 % | 858.0 M 0.9 % |
utf8/code.pike#encode_yiddish | 485.9 M 3.3 % | 692.5 M 3.8 % |
I also tested on arm32, the speedups are around 50%.