[performance] do not assume that `== can modify call_outs
the call_out data structures in backends are protected again reentrance,
so we can safely assume that `== does never modify the call_out hash
table. this change makes backend_find_call_out significantly faster and
speeds up the corresponding call_out handling benchmark by 25%. It is
still _much_ slower than in 7.8, probably because call_outs are objects
in 7.9, which introduces quite some overhead.
[performance] Some tweaks to stralloc to improve performance
Increased the hash-size significantly.
It now aims for one strings per bucket instead of 4.
Changed to only have one short_string block allocator. The wide short
strings are simply fewer characters long now.
Also, do not re-order the chains in findstring.
Update the flags in realloc_shared_string so the code does not have to
be duplicated in the two places the function is used.
Changed the switch to if/else in low_set_index.
This made that function about 3x faster, at least when setting indices
in narrow strings (the case that is now first, and was previously last
in the if/else gcc generated).
[performance] Fixed the local+local and some other opcodes
They now use destructive operations when possible.
Also added an inline version of string+string to the local+=local
[performance] Unroll the crc32si, and only xor once.
This more than doubled the hashing speed, but makes even more
assumptions about how the function is called.
- pike/src/apply_low.h (+5/-10)(15 lines)
- pike/src/interpret.c (+185/-15)(200 lines)
- pike/src/interpret.h (+6/-2)(8 lines)
- pike/src/interpret_functions.h (+102/-22)(124 lines)
- pike/src/pike_macros.h (+12/-0)(12 lines)
[performance] Slightly smaller low_mega_apply.
There is now only one instance of the inclusion of apply_low.h (does
anyone else feel a need for less lowness around here?) which actually
made it faster (previously there were two cases, one for scoped
functions calles and one for calls without scope).
Also created a new version of mega_apply (named lower_mega_apply) that
can only do APPLY_LOW, and only the most common cases.
It falls back to using the old low_mega_apply when needed.
Added a lot of lowness to places to utilize the optimization.
This actually saves surprising amounts of CPU in code calling a lot of
small functions, lower_mega_apply() is about 2x faster than
low_mega_apply(APPLY_LOW,...) when it does not hit one of the cases it
does not support (trampolines, calling non function constants,
variables or arrays).
There is unfortunately now even more code duplication around, but
since the code is slightly different that is rather hard to avoid.
- pike/src/interpret.c (+167/-63)(230 lines)
- pike/src/interpret.h (+10/-3)(13 lines)
- pike/src/pike_embed.c (+3/-2)(5 lines)
[performance] Do not use block-alloc for pike_frame and catch_context
They are too important for code execution speed.
struct pike_frames are allocated in chunks but not free:d until the
program exists. This is basically just like the normal stack, and for
all but the most extreme of recursive programs this is not really an
issue. And for those programs the only loss now is that we are not
returning the frame memory to the system, we are actually using less
memory at peak.
The catch_context structures (that are fairly large, anyway, 80 bytes
on my machine) are simply allocated using malloc, and up to 100 free
ones are kept in a list for quick use.
[performance] Use the hashtable more when indexing objects
Now it is used even if there is only one identifier in the object.
That helps more than it should, really
[performance] When setting object variables, only check svalue_is_zero if needed
This speeds up assignment of a lot of object variable types,
svalue_is_zero is (comparatively) expensive.
[performance] Speed up the low_return function noticeably.
This code reordering/redundant test removal makes the function about 10% faster.
[performance] Significantly faster is_lt and svalue_is_true
The is_lt function now uses no stack at all, which speeds it up about
a factor of ten (for the case where both arguments are integers).
Much the same was done for svalue_is_true.
Also, the order of the tests were rearrenged to get some other
Interrestingly enough is_lt is actually faster even for the complex
cases now, for whatever reason gcc seems to generate better code.