Multi-cpu support in Pike
This is a draft spec for how to implement multi-cpu support in Pike.
The intention is that it gets extended along the way as more issues
gets ironed out. Discussions take place in "Pike dev" in LysKOM or
Initial draft created 8 Nov 2008 by Martin Stjernholm.
Background and goals
Pike supports multiple threads, but like many other high-level
languages it only allows one thread at a time to access the data
structures. This means that the utilization of multi-cpu and
multi-core systems remains low, even though there are some modules
that can do isolated computational tasks in parallell (e.g. the Image
It is the so-called "interpreter lock" that must be locked to access
any reference variable (i.e. everything except floats and native
integers). This lock is held by default in essentially all C code and
is explicitly unlocked in a region by the THREADS_ALLOW/
THREADS_DISALLOW macros. On the pike level, the lock is always held -
no pike variable can be accessed and no pike function can be called
The purpose of the multi-cpu support is to rectify this. The design
goals are, in order of importance:
1. Pike threads should be able to execute pike code concurrently on
multiple cpus as long as they only modify thread local pike data
and read a shared pool of static data (i.e. the pike programs,
modules and constants).
2. There should be as few internal hot spots as possible (preferably
none) when pike code is executed concurrently. Care must be taken
to avoid internal synchronization, or updates of shared data that
would cause "cache line ping-pong" between cpus.
3. The concurrency should be transparent on the pike level. Pike code
should still be able to access shared data without locking and
without risking low-level inconsistencies. (So Thread.Mutex etc
would still be necessary to achieve higher level synchronization.)
4. There should be tools on the pike level to allow further
performance tuning, e.g. lock-free queues, concurrent access hash
tables, and the possibility to lock different regions of shared
data separately. These tools should be designed so that they are
easy to slot into existing code with few changes.
5. There should be tools to monitor and debug concurrency. It should
be possible to make assertions that certain objects aren't shared,
and that certain access patterns don't cause thread
synchronization. This is especially important if goal (3) is
realized, since the pike code by itself won't show what is shared
and what is thread local.
6. C modules should continue to work without source level
modification (but likely without allowing any kind of
Note that even if goal (3) is accomplished, this is no miracle cure
that would make all multithreaded pike programs run with optimal
efficiency on multiple cpus. One could expect better concurrency in
old code without adaptions, but it could still be hampered
considerably by e.g. frequent updates to shared data. Concurrency is a
problem that must be taken into account on all levels.
Perl: All data is thread local by default. Data can be explicitly
shared, in which case Perl ensures internal consistency. Every shared
variable is apparently locked individually. Referencing a thread local
variable from a shared one causes the thread to die. See perthrtut(1).
Python: Afaik it's the same state of affairs as Pike.
The basic approach is to divide all data into thread local and shared:
o Thread local data is everything that is accessible to one thread
only, i.e. there are no references to anything in it from shared
data or from any other thread. This is typically data that the
current thread has created itself and only reference from the
stack. The thread can access its local data without locking.
o Shared data is everything that is accessible from more than one
thread. Access to it is synchronized using a global read/write
lock, the so-called "global lock". I.e. this lock can either be
locked for reading by many threads, or be locked by a single thread
for writing. Locking the global lock for writing is the same as
locking the interpreter lock in current pikes. (This single lock is
refined later - see issue "Lock spaces".)
o There is also a special case where data can be "disowned", i.e. not
shared and not local in any thread. This is used in e.g.
Thread.Queue for the objects that are in transit between threads.
Disowned data cannot have arbitrary references to it - it must
always be under the control of some object that in some way ensures
consistency. (Garbage could be made disowned since it by definition
no longer is accessible from anywhere, but of course it is always
better to clean it up instead.)
+--------+ +---------------------+ Direct +--------+
| |<-- refs --| Thread 1 local data |<- - access - -| |
| | +---------------------+ | Thread |
| | | 1 |
| |<- - - - Access through global lock only - - - -| |
| Shared | +--------+
| data | +---------------------+ Direct +--------+
| |<-- refs --| Thread 2 local data |<- - access - -| |
| | +---------------------+ | Thread |
| | | 2 |
| |<- - - - Access through global lock only - - - -| |
| | +--------+
+--------+ ... etc ...
The principal use case for this model is that threads can do most of
their work with local data and read access to the shared data, and
comparatively seldom require the global write lock to update the
shared data. Every shared thing does not have its own lock since that
would cause excessive lock overhead.
Note that the shared data is typically the same as the data referenced
from the common environment (i.e. the "global data").
Also note that the current object (this) always is shared in pike
modules, so a thread cannot assume free access to it. In other pike
classes it would often be shared too, but it is still important to
utilize the situation when it is thread local. See issue "Function
A thread local thing, and all the things it references directly or
indirectly, automatically becomes shared whenever it gets referenced
from a shared thing.
A shared thing never automatically becomes thread local, but there is
a function to explicitly "take" it. It would first have to make sure
there are no references to it from shared or other thread local things
(c.f. issue "Moving things between lock spaces"). Thread.Queue has a
special case so that if a thread local thing with no other refs is
enqueued, it is disowned by the current thread, and later becomes
thread local in the thread that dequeues it.
Issue: Lock spaces
Having a single global read/write lock for all shared data could
become a bottleneck. Thus there is a need for shared data with locks
separate from the global lock. Things that share a common lock is
called a "lock space", and it is always possible to look up the lock
that governs any given thing (see issue "Memory object structure").
A special global lock space, which corresponds to the shared data
discussed above, is created on startup. All others have to be created
The intended use case for lock spaces is a "moderately large"
collection of things: Too large and you get outlocking problems, too
small and the lock overhead (both execution- and memorywise) gets
prohibiting. A typical lock space could be a RAM cache consisting of a
mapping and all its content.
Many different varieties of lock space locks can be considered, e.g. a
simple exclusive access mutex lock or a read/write lock, priority
locks, locks that ensure fairness, etc. Therefore different (C-level)
implementations should be allowed.
One important characteristic of lock space locks is whether they are
implicit or explicit:
Implicit locks are locked internally, without intervention on the pike
level. The lock duration is unspecified; locks are only acquired to
ensure internal consistency. All low level data access functions check
whether the lock space for the accessed thing is locked already. If it
isn't then the lock is acquired automatically. All implicit locks have
a well defined lock order (by pointer comparison), and since they only
are taken to guarantee internal consistency, an access function can
always release a lock to ensure correct order (see also issue "Lock
Explicit locks are exposed to the pike level and must be locked in a
similar way to Thread.Mutex. If a low level data access function
encounters an explicit lock that isn't locked, it throws an error.
Thus it is left to the pike programmer to avoid deadlocks, but the
pike core won't cause any by itself. Since the pike core keeps track
which lock governs which thing it ensures that no lock violating
access occurs, which is a valuable aid to ensure correctness.
One can also consider a variant with a read/write lock space lock that
is implicit for read but explicit for write, thus combining atomic
pike-level updates with the convenience of implicit locking for read
The scope of a lock space lock is (at least) the state inside all the
things it contains, but not the set of things itself, i.e. things
might be added to a lock space without holding a write lock. Removing
a thing from a lock space always requires the write lock since that is
necessary to ensure that a lock actually governs a thing for as long
as it is held (regardless it's for reading or writing).
See also issues "Memory object structure" and "Lock space locking" for
Issue: Garbage collector
Pike has used refcounting to collect noncyclic structures, combined
with a stop-the-world periodical collector for cyclic structures. The
periodic pauses are already a problem, and it only gets worse as the
heap size and number of concurrent threads increase. Since the gc
needs an overhaul anyway, it makes sense to replace it with a more
PHD-200-10.ps [FIXME: ref] is a recent thesis work that combines
several state-of-the-art gc algorithms to an efficient whole: It
describes a generational collector that uses deferred-update
refcounting for old things with on-the-fly collection, and on-the-fly
mark-and-sweep for young things. An on-the-fly cycle detector is also
employed for the refcounted area. See said work for rationale and
Effects of using this in Pike:
a. References from the C or pike stacks don't need any handling at
all (see also issue "Garbage collection and external references").
b. Special code is used to update refs in the heap. During certain
circumstances, before changing a pointer inside a thing which can
point to another thing, the state of all non-NULL pointers in it
are copied to a thread local log.
c. A new LogPointer field is required per thing. If a state copy has
taken place as described above, it points to the log that contains
the original pointer state of the thing.
Data containers that can be of arbitrary size (i.e. arrays,
mappings and multisets) should be segmented into fixed-sized
chunks with one LogPointer each, so that the state copy doesn't
get arbitrarily large.
d. The double-linked lists aren't needed. Hence two pointers less per
e. The refcounter word is changed to hold both normal refcount, weak
count(?), and flags. Overflowed counts are stored in a separate
f. The collector typically runs concurrently with the rest of the
program. It sometimes interrupts the other threads for handshakes.
These interrupts are not aligned with the evaluator callback
calls, since that would cause too much pausing of the collector
thread. This requires that threads can be stopped and resumed
externally. FIXME: Verify this in pthreads and on windows.
g. All garbage collection, both for noncyclic and cyclic garbage, are
discovered and handled by the gc thread. The other threads never
frees any block known to the gc.
g. An effect of the above is that all garbage is discovered by a
separate collector thread which doesn't execute any other pike
code. This opens up the issue on how to call destruct functions.
At least thread local things should reasonably get their destruct
calls in that thread. A problem is however what to do when that
thread has exited or emigrated (see issue "Foreign thread
For shared things it's not clear which thread should call destruct
anyway, so in that case any thread could do it. It might however
be a good idea to not do it directly in the gc thread, since doing
so would require that thread too to be a proper pike thread with
pike stack etc; it seems better to keep it an "invisible"
low-level thread outside the "worker" threads. In programs with a
"backend thread" it could be useful to allow the gc thread wake up
the backend thread to let it execute the destruct calls.
h. The most bothersome problem is that things are no longer freed
right away when running out of refs. This behavior in Pike is used
implicitly in many places, mainly to release locks timely by just
putting them in a local variable that gets freed when the function
exits (either by normal return or by exception).
Maybe a solution can be devised to keep this characteristic in
that special case, i.e. when a thread local thing only got a
single reference from the stack. This should be easy to detect in
the compiler: It's an assignment to a local variable that never
gets referenced. That's already a warning, but it is currently
tuned down to not warn in these cases (precisely to allow this
So the compiler could in such cases add implicit destruct calls on
function exit. Consider however if someone adds e.g. a werror call
to print out a description of the MutexKey object. The local
variable is referenced, but one won't expect that the innocent
werror() would change the freeing of the MutexKey. It's therefore
probably better to strengthen the compiler warning and require
people to deal with it on the Pike level (a werror for debug
purposes is unlikely to be there permanently, at least).
Question: Are there more cases where pike programmers expect
i. FIXME: How to solve weak refs?
j. One might consider separating the refcounts from the things by
using a hash table. This makes sense when considering that only
the collector thread is using the refcounts, thereby avoiding
false aliasing occurring from refcounter updates (and other gc
related flags) by that thread.
All the hash table lookups would however incur a significant
overhead in the gc thread. A better alternative would be to use a
bitmap based on the possible allocation slots used by the malloc
implementation, but that would require very tight integration with
the malloc system. The bitmap could work with only two bits per
refcounter - research shows that most objects in a refcounted heap
have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3
would then be stored in a hash table.
k. FIXME: Is the third NOP handshake really necessary?
To simplify memory handling, the gc should be used consistently on all
heap structs, regardless whether they are pike visible things or not.
An interesting question is whether the type info for every struct
(more concretely, the address of some area where the gc can find the
functions it needs to handle the struct) is carried in the struct
itself (through a new pointer field), or if it continues to be carried
in the context for every pointer to the struct (e.g. in the type field
Issue: Memory object structure
Of concern are the memory objects known to the gc. They are called
"things", to avoid confusion with "objects" which are the structs for
There are two types of things:
o First class things with gc header and lock space pointer. Most pike
visible types are first class things. The exceptions are ints and
floats, which are passed by value.
o Second class things contain only a gc header. They are similar to
first class except that their lock spaces are implicit from the
referencing things, which means all those referencing things must
always be in the same lock space.
Thread local things could have NULL as lock space pointer, but as a
debug measure they could also point to the thread object so that it's
possible to detect bugs with a thread accessing things local to
Before the multi-cpu architecture, there are global double-linked
lists for each referenced pike type: array, mapping, multiset, object,
and program (strings and types are handled differently). Thanks to the
new gc, the double-linked lists aren't needed at all anymore.
| Thread 1 | | Thread 2 |
: refs O : : O O :
,----- O <--> O : ,------- O O ------.
| : O O -----. | : O O : |
| :............: | | :............: |
ref | | ref | ref | ref
| | | |
.|.............. ..v.......v..... refs ..............|.
: | refs : ref : O O O <------> O O v :
: v O <---> O ------------> O O : : O O :
: O O O O : : O O O : : O O O :
+--------------+ +--------------+ +--------------+
| Lock space 1 | | Lock space 2 | | Lock space 3 |
+--------------+ +--------------+ +--------------+
This figure tries to show some threads and lock spaces, and their
associated things as O's inside the dotted areas. Some examples of
possible references between things are included: Thread local things
can only reference things belonging to the same thread or things in
any lock space, while things in lock spaces can reference things in
the same or other lock spaces. There can be cyclic structures that
span lock spaces.
The lock space lock structs are tracked by the gc just like anything
else, and they are therefore garbage collected when they become empty
and unreferenced. The gc won't free a lock space lock struct that is
locked since it always got at least one reference from the array of
locked locks that each thread maintains (c.f. issue "Lock space
Issue: Lock space lock semantics
There are three types of locks:
o A read-safe lock ensures only that the data is consistent, not that
it stays constant. This allows lock-free updates in things where
possible (which could include arrays, mappings, and maybe even
multisets and objects of selected classes).
o A read-constant lock ensures both consistency and constantness
(i.e. what usually is assumed for a read-only lock).
o A write lock ensures complete exclusive access. The owning thread
can modify the data, and it can assume no other changes occur to it
(barring refcounters - see below). The owning thread can also under
limited time leave the data in inconsistent state. This is however
still limited by the calls to check_threads(), which means that the
state must be consistent again every time the evaluator callbacks
are run. See issue "Emulating the interpreter lock".
Allowing lock-free updates is attractive, so the standard read/write
lock that governs the global lock space will probably be multiple
An exception to the lock semantics above are refcounters or any other
fields used by the gc (the gc typically runs concurrently in a thread
of its own, and it doesn't heed any locks - see issue "Garbage
collector"). A ref to a thing can always be added or removed, even if
another thread holds an exclusive write lock on it. That since the
thing will only be freed by the gc, which won't free it if a ref is
Issue: Lock space locking
This is the locking procedure to access a thing:
1. Read the lock space pointer. If it's NULL then the thing is thread
local and nothing more needs to be done.
2. Address an array containing the pointers to the lock spaces that
are already locked by the thread.
3. Search for the lock space pointer in the array. If present then
nothing more needs to be done.
4. Lock the lock space lock as appropriate. Note that this can imply
other implicit locks that are held are unlocked to ensure correct
lock order (see issue "Lock spaces"). Then it's added to the
A thread typically won't hold more than a few locks at any time (less
than ten or so), so a plain array and linear search should perform
well. For quickest possible access the array should be a static thread
local variable (c.f. issue "Thread local storage"). If the array gets
full, implicit locks in it can be released automatically to make
space. Still, a system where more arrays can be allocated and chained
on would perhaps be prudent to avoid the theoretical possibility of
running out of space for locked locks.
Since implicit locks can be released (almost) at will, they are open
for performance tuning: Too long lock durations and they'll outlock
other threads, too short and the locking overhead becomes more
significant. As a starting point, it seems reasonable to release them
at every evaluator callback call (i.e. at approximately every pike
function call and return).
Issue: Moving things between lock spaces
Things can be moved between lock spaces, or be made thread local or
disowned. In all these cases, one or more things are given explicitly.
It's natural if not only those things are moved, but also all other
things in the same source lock space that are referenced from the
given things and not from anywhere else (this operation is the same as
Pike.count_memory does). In the case of making things thread local or
disowned, it is also necessary to check that the explicitly given
things aren't referenced from elsewhere.
FIXME: This is a problem with the proposed garbage collector (see
issue "Garbage collector"). Old things got refcounts that can be used,
but they might be stale, and the logging doesn't provide information
in the form we need. New things are even worse since they got no
refcounts at all that can be used to check for outside refs.
Furthermore, there is a race since an external ref can be added at any
time from any thread.
All this is settled when the gc is run: If the "controlled" refs are
temporarily ignored then the set to move is the one that would turn
into garbage. But it is not good to either have to wait for the gc or
run it synchronously.
Also, the problem above applies to Pike.count_memory too.
Strings are unique in Pike. This property is hard to keep if threads
have local string pools, since a thread local string might become
shared at any moment, and thus would need to be moved. Therefore the
string hash table remains global, and lock congestion is avoided with
some concurrent access hash table implementation. See issue "Lock-free
Lock-free is a good start, but the hash function must also provide a
good even distribution to avoid hotspots. Pike currently uses an
in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a
more widespread and better studied alternative should be considered.
There seems to be few that are below O(n) (which DO_HASHMEM is),
Like strings, types are globally unique and always shared in Pike.
That means lock-free access to them is desirable, and it should also
be doable fairly easily since they are constant. Otoh it's probably
not as vital as for strings since types typically only are built
Issue: Shared mapping and multiset data blocks
An interesting issue is if things like mapping/multiset data blocks
should be first or second class things (c.f. issue "Memory object
structure"). If they're second class it means copy-on-write behavior
doesn't work across lock spaces. If they're first class it means
additional overhead handling the lock spaces of the mapping data
blocks, and if a mapping data is shared between lock spaces then it
has to be in some third lock space of its own, or in the global lock
space, neither of which would be very good.
So it doesn't look like there's a better way than to botch
copy-on-write in this case.
Issue: Emulating the interpreter lock
For compatibility with old C modules, and for the _disable_threads
function, it is necessary to retain a complete lock like the current
interpretator lock. It has to lock the global area for writing, and
also stop all access to all lock spaces, since the thread local data
might refer to any lock space.
This lock is implemented as a read/write lock, which normally is held
permanently for reading by all threads. Only when a thread is waiting
to acquire the compat interpreter lock is it released as each thread
goes into check_threads().
This lock cannot wait for explicit lock space locks to be released.
Thus it can override the assumption that a lock space is safe from
tampering by holding a write lock on it. Still, it's only available
from the C level (with the exception of _disable_threads) so the
situation is not any different from the way the interpreter lock
overrides Thread.Mutex today.
Issue: Function calls
A lock on an object is almost always necessary before calling a
function in it. Therefore the central apply function (mega_apply) must
ensure an appropriate lock is taken. Which kind of lock
(read-safe/read-constant/write - see issue "Lock space lock
semantics") depends on what the function wants to do. Therefore all
object functions are extended with flags for this.
The best default is probably read-safe. Flags for no locking (for the
few special cases where the implementations actually are completely
lock-free) and for compat-interpreter-lock-locking should probably
exist as well. A compat-interpreter-lock flag is also necessary for
global functions that don't have a "this" object (aka efuns).
Having the required locking declared this way also alleviates each
function from the burden of doing the locking to access the current
storage, and it allows future compiler optimizations to minimize lock
"Forgotten" locks after exceptions shouldn't be a problem: Explicit
locks are handled just like today (i.e. it's up to the pike
programmer), and implicit locks can safely be released when an
exception is thrown.
One case requires attention: An old-style function that requires the
compat interpreter lock might catch an error. In that case the error
system has to ensure that lock is reacquired.
Issue: C module interface
A new add_function variant is probably added for new-style functions.
It takes bits for the flags discussed for issue "Function calls".
New-style functions can only assume free access to the current storage
according to those flags; everything else must be locked (through a
new set of macros/functions).
Accessor functions for data types (e.g. add_shared_strings,
mapping_lookup, and object_index_no_free) handles the necessary
locking internally. They will only assume that the thing is safe, i.e.
that the caller ensures the current thread controls at least one ref.
THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in
There will be new GC callbacks for walking module global pointers to
things (see issue "Garbage collection and external references").
Issue: C module compatibility
Ref issue "Emulating the interpreter lock".
Ref issue "Garbage collection and external references".
Issue: Garbage collection and external references
The current gc design is that there is an initial "check" pass that
determines external references by counting all internal references,
and then for each thing subtract it from its refcount. If the result
isn't zero then there are external references (e.g. from global C
variables or from the C stack) and the thing is not garbage.
The new gc (c.f. issue "Garbage collector") does not refcount external
refs and refs from the C or Pike stacks. It needs to find them some
References from global C variables are few, so they can be dealt with
by requiring C modules and the core parts to provide callbacks that
lets the gc walk through them (see issue "C module interface"). This
is however not compatible with old C modules.
References from C stacks are common, and it is infeasible to require
callbacks that keep track of them. The gc instead has to scan the C
stacks for the threads and treat any aligned machine word containing
an apparently valid pointer to a gc candidate thing as an external
reference. This is the common approach used by standalone gc libraries
that don't require application support. For reference, here is one
such garbage collector, written in C++:
Its source is here:
The same approach is also necessary to cope with old C modules (see
issue "C module compatibility"), but since global C level pointers are
few, it might not be mandatory to get this working.
Issue: Global pike level caches
A lock-free implementation should be used. The things in the queue are
typically disowned to allow them to become thread local in the reading
Issue: False sharing
False sharing occurs when thread local things used frequently by
different threads are next to each other so that they share the same
cache line. Thus the cpu caches might force frequent resynchronization
of the cache line even though there is no apparent hotspot problem on
the C level.
This can be a problem in particular for all the block_alloc pools
containing small structs. Using thread local pools is seldom a
workable solution since most thread local structs might become shared
One way to avoid it is to add padding (and alignment). Cache line
sizes are usually 64 bytes or less (at least for Intel ia32). That
should be small enough to make this viable in many cases.
FIXME: Check cache line sizes on the other important architectures.
Another way is to move things when they get shared, but that is pretty
complicated and slow.
Issue: Malloc and block_alloc
Standard OS mallocs are usually locking. Bundling a lock-free one
could be important. FIXME: Survey free implementations.
Block_alloc is a simple homebrew memory manager used in several
different places to allocate fixed-size blocks. The block_alloc pools
are often shared, so they must allow efficient concurrent access. With
a modern malloc, it is possible that the need for block_alloc is gone,
or perhaps the malloc lib has builtin support for fixed-size pools.
Making a lock-free implementation is nontrivial, so the homebrew ought
to be ditched in any case.
A problem with ditching block_alloc is that there is some code that
walks through all allocated blocks in a pool, and also avoids garbage
by freeing the whole pool altogether. FIXME: Investigate alternatives
See also issue "False sharing".
Issue: Heap size control
There should be better tools to control the heap size. It should be
possible to set the wanted heap size so that the gc runs timely before
that limit is reached. Pike should detect the available amount of real
memory (i.e. not counting swap) to use as default. The gc should still
use a garbage projection strategy to keep the process below the
configured maximum size for as long as possible. This is more
important if the gc is used also for previously refcounted garbage
(c.f. issue "Garbage collector").
Malloc calls should be wrapped to allow the gc to run in blocking mode
in case they fail.
Issue: The compiler
Issue: Foreign thread visits
Issue: Pike security system
It is possible that keeping the pike security system intact would
complicate the implementation, and even if it was kept intact a lot of
testing would be required before one can be confident that it really
works (and there are currently very few tests for it in the test
Also, the security system isn't used at all to my (mast's) knowledge,
and it is not even compiled in by default (has to be enabled with a
All this leads to the conclusion that it is easiest to ignore the
security system altogether, and if possible leave it as it is with the
option to get it working later.
Issue: Contention-free counters
There is probably a need for contention-free counters in several
different areas. They should be possible to update from several
threads in parallell without synchronization. Querying the current
count is always approximate since it can be changing simultaneously in
other threads. However, the thread's own local count is always
They should be separated from the blocks they apply to, to avoid cache
line invalidation of those blocks.
To accomplish that, a generic tool somewhat similar to block_alloc is
created that allocates one or more counter blocks for each thread. In
these blocks indexes are allocated, so a counter is defined by the
same index into all the thread local counter blocks.
Each thread can then modify its own counters without locking, and it
typically has its own counter blocks in the local cache while the
corresponding main memory is marked invalid. To query a counter, a
thread would need to read the blocks for all other threads.
This means that these counters are efficient for updates but less so
for queries. However, since queries always are approximate, it is
possible to cache them for some time (e.g. 1 ms). Each thread would
need its own cache though, since the local count cannot be cached.
It should be lock-free for allocating and freeing counters, and
preferably also for starting and stopping threads (c.f. issue "Foreign
thread visits"). In both cases the freeing steps represents a race
problem - see issue "Hazard pointers". To free counters, the counter
index would constitute the hazard pointer.
Issue: Lock-free hash table
A good lock-free hash table implementation is necessary. A promising
one is http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html.
It requires a CAS (Compare And Swap) instruction to work, but that
shouldn't be a problem. The java implementation
(http://sourceforge.net/projects/high-scale-lib) is Public Domain. In
the comments there is talk about efforts to make a C version.
It supports (through putIfAbsent) the uniqueness requirement for
strings, i.e. if several threads try to add the same string (at
different addresses) then all will end up with the same string pointer
The java implementation relies on the gc to free up the old hash
tables after resize. We don't have that convenience, but the problem
is still solvable; see issue "Hazard pointers".
Issue: Hazard pointers
A problem with most lock-free algorithms is how to know no other
thread is accessing a block that is about to be freed. Another is the
ABA problem which can occur when a block is freed and immediately
allocated again (common for block_alloc).
Hazard pointers are a good way to solve these problems without leaving
the blocks to the garbage collector (see
http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf). So a
generic hazard pointer tool might be necessary for blocks not known to
Note however that a more difficult variant of the ABA problem still
can occur when the block cannot be freed after leaving the data
structure. (In the canonical example with a lock-free stack - see e.g.
"ABA problem" in Wikipedia - consider the case when A is a thing that
continues to live on and actually gets pushed back.) The only reliable
way to cope with that is probably to use wrappers.
Issue: Thread local storage
Implementation would be considerably simpler if working TLS can be
assumed on the C level, through the __thread keyword (or
__declspec(thread) in Visual C++). A survey of the support for TLS in
common compilers and OS'es is needed to decide whether this is an
o GCC: __thread is supported. Source: Wikipedia.
FIXME: Check from which version.
o Visual C++: __declspec(thread) is supported. Source: Wikipedia.
FIXME: Check from which version.
o Intel C compiler: Support exists. Source: Wikipedia.
FIXME: Check from which version.
o Sun C compiler: Support exists. Source: Wikipedia.
FIXME: Check from which version.
o Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works
for dynamic libs. C.f. http://people.redhat.com/drepper/tls.pdf.
FIXME: Check from which version of glibc and kernel (if relevant).
o Windows (i386, x86_64): TLS is supported but does not always work
in dll's loaded using LoadLibrary (which means all dynamic modules
in pike). C.f. http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx.
According to Wikipedia this is fixed in Vista and Server 2008
(FIXME: verify). In any case, TLS is still usable in the pike core.
o MacOS X: FIXME: Check this.
o Solaris: FIXME: Check this.
o *BSD: FIXME: Check this.
Issue: Platform specific primitives
Some low-level primitives, such as CAS and fences, are necessary to
build the various lock-free tools. A third-party library would be
An effort to make a standardized library is here:
level interface at the end). It apparently lacks implementation,
CAS(address, old_value, new_value)
Compare-and-set: Atomically sets *address to new_value iff its
current value is old_value. Needed for 32-bit variables, and on
64-bit systems also for 64-bit variables.
Increments/decrements *address atomically. Can be simulated with
CAS. 32-bit version necessary, 64-bit version would be nice.
Load fence: All memory reads in the thread before this point are
guaranteed to be done (i.e. be globally visible) before any
Store fence: All memory writes in the thread before this point are
guaranteed to be done before any following it.
Memory fence: Both load and store fence at the same time. (On many
architectures this is implied by CAS etc, but we shouldn't assume
The following operations are uncertain - still not known if they're
useful and supported enough to be required, or if it's better to do
CASW(address, old_value_low, old_value_high, new_value_low, new_value_high)
A compare-and-set that works on a double pointer size area.
Supported on more modern x86 and x86_64 processors (c.f.
Survey of platform support:
o Windows/Visual Studio: Got "Interlocked Variable Access":
o FIXME: More..
OpenMP (see www.openmp.org) is a system to parallelize code using
pragmas that are inserted into the code blocks. It can be used to
easily parallelize otherwise serial internal algorithms like searching
and all sorts of loops over arrays etc. Thus it addresses a different
problem than the high-level parallelizing architecture above, but it
might provide significant improvements nevertheless.
It's therefore worthwhile to look into how this can be deployed in the
Pike sources. If support is widespread enough, it could be considered
to even make it a requirement to be able to deploy the builtin tools
for atomicity and ordering (provided they are useful outside the omp
Compiler support (taken from www.openmp.org):
o gcc since 4.3.2.
o Microsoft Visual Studio 2008 or later.
o Sun compiler (starting version unknown).
o Intel compiler since 10.1.
o ..and some more.
FIXME: Survey platform-specific limitations.
Pragmatic nonblocking synchronization for real-time systems
DCAS is not a silver bullet for nonblocking algorithm design