Multi-cpu support in Pike |
------------------------- |
|
This is a draft spec for how to implement multi-cpu support in Pike. |
The intention is that it gets extended along the way as more issues |
gets ironed out. Discussions take place in "Pike dev" in LysKOM or |
pike-devel@lists.lysator.liu.se. |
|
Initial draft created 8 Nov 2008 by Martin Stjernholm. |
|
|
Background and goals |
|
Pike supports multiple threads, but like many other high-level |
languages it only allows one thread at a time to access the data |
structures. This means that the utilization of multi-cpu and |
multi-core systems remains low, even though there are some modules |
that can do isolated computational tasks in parallell (e.g. the Image |
module). |
|
It is the so-called "interpreter lock" that must be locked to access |
any reference variable (i.e. everything except floats and native |
integers). This lock is held by default in essentially all C code and |
is explicitly unlocked in a region by the THREADS_ALLOW/ |
THREADS_DISALLOW macros. On the pike level, the lock is always held - |
no pike variable can be accessed and no pike function can be called |
otherwise. |
|
The purpose of the multi-cpu support is to rectify this. The design |
goals are, in order of importance: |
|
1. Pike threads should be able to execute pike code concurrently on |
multiple cpus as long as they only modify thread local pike data |
and read a shared pool of static data (i.e. the pike programs, |
modules and constants). |
|
2. There should be as few internal hot spots as possible (preferably |
none) when pike code is executed concurrently. Care must be taken |
to avoid internal synchronization, or updates of shared data that |
would cause "cache line ping-pong" between cpus. |
|
3. The concurrency should be transparent on the pike level. Pike code |
should still be able to access shared data without locking and |
without risking low-level inconsistencies. (So Thread.Mutex etc |
would still be necessary to achieve higher level synchronization.) |
|
4. Current pike code should continue to run without compatibility |
problems, and also keep the same time and space complexity |
characteristics in the uniprocessor or locked case. This includes |
important optimizations that defines de-facto behavior which much |
pike code takes into account. |
|
5. There should be tools on the pike level to allow further |
performance tuning, e.g. lock-free queues, concurrent access hash |
tables, and the possibility to lock different regions of shared |
data separately. These tools should be designed so that they are |
easy to slot into existing code with few changes. |
|
6. There should be tools to monitor and debug concurrency. It should |
be possible to make assertions that certain objects aren't shared, |
and that certain access patterns don't cause thread |
synchronization. This is especially important if goal (3) is |
realized, since the pike code by itself won't show what is shared |
and what is thread local. |
|
7. C modules should continue to work without source level |
modification (but likely without allowing any kind of |
concurrency). |
|
Note that even if goal (3) is accomplished, this is no miracle cure |
that would make all multithreaded pike programs run with optimal |
efficiency on multiple cpus. One could expect better concurrency in |
old code without adaptions, but it could still be hampered |
considerably by e.g. frequent updates to shared data. Concurrency is a |
problem that must be taken into account on all levels. |
|
|
Other languages |
|
Perl: All data is thread local by default. Data can be explicitly |
shared, in which case Perl ensures internal consistency. Every shared |
variable is apparently locked individually. Referencing a thread local |
variable from a shared one causes the thread to die. See perthrtut(1). |
|
Python: Afaik it's the same state of affairs as Pike. |
|
|
Solution overview |
|
The basic approach is to divide all data into thread local and shared: |
|
o Thread local data is everything that is accessible to one thread |
only, i.e. there are no references to anything in it from shared |
data or from any other thread. This is typically data that the |
current thread has created itself and only reference from the |
stack. The thread can access its local data without locking. |
|
o Shared data is everything that is accessible from more than one |
thread. Access to it is synchronized using a global read/write |
lock, the so-called "global lock". I.e. this lock can either be |
locked for reading by many threads, or be locked by a single thread |
for writing. Locking the global lock for writing is the same as |
locking the interpreter lock in current pikes. (This single lock is |
refined later - see issue "Lock spaces".) |
|
o There is also a special case where data can be "disowned", i.e. not |
shared and not local in any thread. This is used in e.g. |
Thread.Queue for the objects that are in transit between threads. |
Disowned data cannot have arbitrary references to it - it must |
always be under the control of some object that in some way ensures |
consistency. (Garbage could be made disowned since it by definition |
no longer is accessible from anywhere, but of course it is always |
better to clean it up instead.) |
|
+--------+ +---------------------+ Direct +--------+ |
| |<-- refs --| Thread 1 local data |<- - access - -| | |
| | +---------------------+ | Thread | |
| | | 1 | |
| |<- - - - Access through global lock only - - - -| | |
| Shared | +--------+ |
| | |
| data | +---------------------+ Direct +--------+ |
| |<-- refs --| Thread 2 local data |<- - access - -| | |
| | +---------------------+ | Thread | |
| | | 2 | |
| |<- - - - Access through global lock only - - - -| | |
| | +--------+ |
+--------+ ... etc ... |
|
The principal use case for this model is that threads can do most of |
their work with local data and read access to the shared data, and |
comparatively seldom require the global write lock to update the |
shared data. Every shared thing does not have its own lock since that |
would cause excessive lock overhead. |
|
Note that the shared data is typically the same as the data referenced |
from the common environment (i.e. the "global data"). |
|
Also note that the current object (this) always is shared in pike |
modules, so a thread cannot assume free access to it. In other pike |
classes it would often be shared too, but it is still important to |
utilize the situation when it is thread local. See issue "Function |
calls". |
|
A thread local thing, and all the things it references directly or |
indirectly, automatically becomes shared whenever it gets referenced |
from a shared thing. |
|
A shared thing never automatically becomes thread local, but there is |
a function to explicitly "take" it. It would first have to make sure |
there are no references to it from shared or other thread local things |
(c.f. issue "Moving things between lock spaces"). Thread.Queue has a |
special case so that if a thread local thing with no other refs is |
enqueued, it is disowned by the current thread, and later becomes |
thread local in the thread that dequeues it. |
|
|
Issue: Lock spaces |
|
Having a single global read/write lock for all shared data could |
become a bottleneck. Thus there is a need for shared data with locks |
separate from the global lock. Things that share a common lock is |
called a "lock space", and it is always possible to look up the lock |
that governs any given thing (see issue "Memory object structure"). |
|
A special global lock space, which corresponds to the shared data |
discussed above, is created on startup. All others have to be created |
explicitly. |
|
The intended use case for lock spaces is a "moderately large" |
collection of things: Too large and you get outlocking problems, too |
small and the lock overhead (both execution- and memorywise) gets |
prohibiting. A typical lock space could be a RAM cache consisting of a |
mapping and all its content. |
|
Many different varieties of lock space locks can be considered, e.g. a |
simple exclusive access mutex lock or a read/write lock, priority |
locks, locks that ensure fairness, etc. Therefore different (C-level) |
implementations should be allowed. |
|
One important characteristic of lock space locks is whether they are |
implicit or explicit: |
|
Implicit locks are locked internally, without intervention on the pike |
level. The lock duration is unspecified; locks are only acquired to |
ensure internal consistency. All low level data access functions check |
whether the lock space for the accessed thing is locked already. If it |
isn't then the lock is acquired automatically. All implicit locks have |
a well defined lock order (by pointer comparison), and since they only |
are taken to guarantee internal consistency, an access function can |
always release a lock to ensure correct order (see also issue "Lock |
space locking"). |
|
Explicit locks are exposed to the pike level and must be locked in a |
similar way to Thread.Mutex. If a low level data access function |
encounters an explicit lock that isn't locked, it throws an error. |
Thus it is left to the pike programmer to avoid deadlocks, but the |
pike core won't cause any by itself. Since the pike core keeps track |
which lock governs which thing it ensures that no lock violating |
access occurs, which is a valuable aid to ensure correctness. |
|
One can also consider a variant with a read/write lock space lock that |
is implicit for read but explicit for write, thus combining atomic |
pike-level updates with the convenience of implicit locking for read |
access. |
|
The scope of a lock space lock is (at least) the state inside all the |
things it contains (with a couple exceptions - see issue "Lock space |
lock semantics"), but not the set of things itself, i.e. things might |
be added to a lock space without holding a write lock. Removing a |
thing from a lock space always requires the write lock on it since |
that is necessary to ensure that a lock actually governs a thing for |
as long as it is held (regardless it's for reading or writing). |
|
See also issues "Memory object structure" and "Lock space locking" for |
more details. |
|
|
Issue: Memory object structure |
|
Of concern are the memory objects known to the gc. They are called |
"things", to avoid confusion with "objects" which are the structs for |
pike objects. |
|
There are two types of things: |
|
o First class things with gc header and lock space pointer. Most pike |
visible types are first class things. The exceptions are ints and |
floats, which are passed by value. |
|
o Second class things contain only a gc header. They are similar to |
first class except that their lock spaces are implicit from the |
referencing things, which means all those referencing things must |
always be in the same lock space. |
|
Thread local things could have NULL as lock space pointer, but as a |
debug measure they could also point to the thread object so that it's |
possible to detect bugs with a thread accessing things local to |
another thread. |
|
Before the multi-cpu architecture, there are global double-linked |
lists for each referenced pike type: array, mapping, multiset, object, |
and program (strings and types are handled differently). Thanks to the |
new gc, the double-linked lists aren't needed at all anymore. |
|
+----------+ +----------+ |
| Thread 1 | | Thread 2 | |
.+----------+. .+----------+. |
: refs O : : O O : |
,----- O <--> O : ,------- O O ------. |
| : O O -----. | : O O : | |
| :............: | | :............: | |
ref | | ref | ref | ref |
| | | | |
.|.............. ..v.......v..... refs ..............|. |
: | refs : ref : O O O <------> O O v : |
: v O <---> O ------------> O O : : O O : |
: O O O O : : O O O : : O O O : |
+--------------+ +--------------+ +--------------+ |
| Lock space 1 | | Lock space 2 | | Lock space 3 | |
+--------------+ +--------------+ +--------------+ |
|
This figure tries to show some threads and lock spaces, and their |
associated things as O's inside the dotted areas. Some examples of |
possible references between things are included: Thread local things |
can only reference things belonging to the same thread or things in |
any lock space, while things in lock spaces can reference things in |
the same or other lock spaces. There can be cyclic structures that |
span lock spaces. |
|
The lock space lock structs are tracked by the gc just like anything |
else, and they are therefore garbage collected when they become empty |
and unreferenced. The gc won't free a lock space lock struct that is |
locked since it always got at least one reference from the array of |
locked locks that each thread maintains (c.f. issue "Lock space |
locking"). |
|
|
Issue: Lock space lock semantics |
|
There are three types of locks: |
|
o A read-safe lock ensures only that the data is consistent, not that |
it stays constant. This allows lock-free updates in things where |
possible (which could include arrays, mappings, and maybe even |
multisets and objects of selected classes). |
|
o A read-constant lock ensures both consistency and constantness |
(i.e. what usually is assumed for a read-only lock). |
|
o A write lock ensures complete exclusive access. The owning thread |
can modify the data, and it can assume no other changes occur to it |
(barring refcounters and lock space pointers - see below), although |
that assumption has to be "weak" since there are a few situations |
when another thread can intervene - see issue "Emulating the |
interpreter lock". |
|
The owning thread can also under limited time leave the data in |
inconsistent state. This is however still limited by the calls to |
check_threads(), which means that the state must be consistent |
again every time the evaluator callbacks are run. The reason is the |
same one as above. |
|
Allowing lock-free updates is attractive, so the standard read/write |
lock that governs the global lock space will probably be multiple |
read-safe/single write. |
|
The lock space lock covers all the data in the thing, with two |
exceptions: |
|
o The refcounter (and other gc-related flags and fields) can always |
change concurrently since the gc runs in a thread of its own, and |
it doesn't heed any locks - see issue "Garbage collector". |
|
A ref to a thing can always be added or removed, even if another |
thread holds an exclusive write lock on it. That since the thing |
will only be freed by the gc, which won't free it if a ref is |
added. |
|
Refcount updates need to be atomic if the refcounts are to be used |
at all from other threads. Even so, they can only be used |
opportunistically since they (almost) always might change |
asynchronously. That could still be good enough for e.g. |
Pike.count_memory (noone could expect it to be accurate anyway if |
another thread is modifying the data structure being measured). |
|
o The lock space pointer itself must at all times be either NULL or |
point to a valid lock space struct, since another thread need to |
access it to tell whether access to the thing is permissible. A |
write lock is required to change the lock space pointer, but even |
so the update must be atomic. |
|
Since the lock space lock structs are collected by the gc, there is |
no risk for races when threads asynchronously dereference lock |
space pointers. |
|
FIXME: What about concurrent gc access to follow pointers? |
|
|
Issue: Lock space locking |
|
This is the locking procedure to access a thing: |
|
1. Read the lock space pointer. If it's NULL then the thing is thread |
local and nothing more needs to be done. |
2. Address an array containing the pointers to the lock spaces that |
are already locked by the thread. |
3. Search for the lock space pointer in the array. If present then |
nothing more needs to be done. |
4. Lock the lock space lock as appropriate. Note that this can imply |
other implicit locks that are held are unlocked to ensure correct |
lock order (see issue "Lock spaces"). Then it's added to the |
array. |
|
A thread typically won't hold more than a few locks at any time (less |
than ten or so), so a plain array and linear search should perform |
well. For quickest possible access the array should be a static thread |
local variable (c.f. issue "Thread local storage"). If the array gets |
full, implicit locks in it can be released automatically to make |
space. Still, a system where more arrays can be allocated and chained |
on would perhaps be prudent to avoid the theoretical possibility of |
running out of space for locked locks. |
|
Since implicit locks can be released (almost) at will, they are open |
for performance tuning: Too long lock durations and they'll outlock |
other threads, too short and the locking overhead becomes more |
significant. As a starting point, it seems reasonable to release them |
at every evaluator callback call (i.e. at approximately every pike |
function call and return). |
|
|
Issue: Garbage collector |
|
Pike has used refcounting to collect noncyclic structures, combined |
with a stop-the-world periodical collector for cyclic structures. The |
periodic pauses are already a problem, and it only gets worse as the |
heap size and number of concurrent threads increase. Since the gc |
needs an overhaul anyway, it makes sense to replace it with a more |
modern solution. |
|
http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2006/PHD/PHD-2006-10.ps |
is a recent thesis work that combines several state-of-the-art gc |
algorithms to an efficient whole. A brief overview of the highlights: |
|
o The reference counts aren't updated for references on the stack. |
The stacks are scanned when the gc runs instead. This saves a great |
deal of refcount updates, and it also simplifies C level |
programming a lot. Only refcounts between things on the heap are |
counted. |
|
o The refcounts are only updated when the gc runs. This saves a lot |
of the remaining updates since if a pointer starts with value p_0 |
and then changes to p_1, then p_2, p_3, ..., and lastly to p_n at |
the next gc, then only p_0->refs needs to be decremented and |
p_n->refs needs to be incremented - the changes in all the other |
refcounts, for the things pointed to in between, cancel out. |
|
o The above is accomplished by thread local logging, to make the old |
p_0 value available to the gc at the next run. This means it scales |
well with many cpu's. |
|
o A generational gc uses refcounting only for old things in the heap. |
New things, which are typically very short-lived, aren't refcounted |
at all but instead gc'ed using a mark-and-sweep collector. This is |
shown to be more efficient for short-lived data, and it handles |
cyclic structures without any extra effort. |
|
o By using refcounting on old data, the gc only need to give |
attention to refcounts that gets down to zero. This means the heap |
can scale to any size without affecting the gc run time, as opposed |
to using a mark-and-sweep collector on the whole heap. Thus the gc |
time scales only with the amount of _change_ in the heap. |
|
o Cyclic structures in the old refcounted data is handled |
incrementally using the fact that a cyclic structure can only occur |
when a refcounter is decremented to a value greater than zero. |
Those things can therefore be tracked and cycle checked in the |
background. The gc uses several different methods to weed out false |
alarms before doing actual cycle checks. |
|
o The gc runs entirely in its own thread. It only needs to stop the |
working threads for a very short time to scan stacks etc, and they |
can be stopped one at a time. |
|
Effects of using this in Pike: |
|
a. References from the C or pike stacks don't need any handling at |
all (see also issue "Garbage collection and external references"). |
|
b. A significant complication in various lock-free algorithms is the |
safe freeing of old blocks (see e.g. issue "Lock-free hash |
table"). This gc would solve almost all such problems in a |
convenient way. |
|
c. Special code is used to update refs in the heap. During certain |
circumstances, before changing a pointer inside a thing which can |
point to another thing, the state of all non-NULL pointers in it |
are copied to a thread local log. |
|
This is mostly problematic since it requires that every pointer |
assignment inside a thing is replaced with a macro or function |
call, which has a big impact on C code. See issue "C module |
interface". |
|
d. A new log_pointer field is required per thing. If a state copy has |
taken place as described above, it points to the log that contains |
the original pointer state of the thing. |
|
Data containers that can be of arbitrary size (i.e. arrays, |
mappings and multisets) should be segmented into fixed-sized |
chunks with one log_pointer each, so that the state copy doesn't |
get arbitrarily large. |
|
e. The double-linked lists aren't needed. Hence two pointers less per |
thing. |
|
f. The refcounter word is changed to hold both normal refcount, weak |
count, and flags. Overflowed counts are stored in a separate hash |
table. |
|
g. The collector typically runs concurrently with the rest of the |
program, but there are some situations when it has to synchronize |
with them (aka handshake). In the research paper this is done by |
letting the gc thread suspend and resume the other threads (one at |
a time). Since preemptive suspend and resume operations are |
generally unsupported in thread libraries (c.f. issue "Preemptive |
thread suspension"), a cooperative approach is necessary: |
|
The gc thread sets a state flag that all other threads need a |
handshake. Threads that are running do the handshake work |
themselves before waiting on a mutex or in the next evaluator |
callback call, and the gc thread handles the threads that are |
currently waiting (ensuring that they don't start in the |
meantime). |
|
The work that needs to be done during a handshake is to set some |
flags and record some local thread state for use by the gc thread. |
This can be done concurrently in several threads, so no locking is |
necessary. |
|
Due to this interaction with the other threads, it's vital that |
the gc thread does not hold any mutex, and that it takes care to |
avoid being stopped (e.g. through an interrupt) while it works on |
behalf on another thread. |
|
h. All garbage collection, both for noncyclic and cyclic garbage, are |
discovered and handled by the gc thread. The other threads never |
frees any block known to the gc. |
|
i. An effect of the above is that all garbage is discovered by a |
separate collector thread which doesn't execute any other pike |
code. This opens up the issue on how to call destruct functions. |
|
At least thread local things should reasonably get their destruct |
calls in that thread. A problem is however what to do when that |
thread has exited or emigrated (see issue "Foreign thread |
visits"). |
|
For shared things it's not clear which thread should call destruct |
anyway, so in that case any thread could do it. It might however |
be a good idea to not do it directly in the gc thread, since doing |
so would require that thread too to be a proper pike thread with |
pike stack etc; it seems better to keep it an "invisible" |
low-level thread outside the "worker" threads. In programs with a |
"backend thread" it could be useful to allow the gc thread wake up |
the backend thread to let it execute the destruct calls. |
|
j. The most bothersome problem is that things are no longer freed |
right away when running out of refs. See issue "Immediate |
destruct/free when refcount reaches zero". |
|
k. Weak refs are handled with a separate refcount in each thing. That |
means things have two refcounts: One for weak refs and another for |
all refs. See also issue "Weak ref garbage collection". |
|
l. One might consider separating the refcounts from the things by |
using a hash table. This makes sense when considering that only |
the collector thread is using the refcounts, thereby avoiding |
false aliasing occurring from refcounter updates (and other gc |
related flags) by that thread. |
|
All the hash table lookups would however incur a significant |
overhead in the gc thread. A better alternative would be to use a |
bitmap based on the possible allocation slots used by the malloc |
implementation, but that would require very tight integration with |
the malloc system. The bitmap could work with only two bits per |
refcounter - research shows that most objects in a refcounted heap |
have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3 |
would then be stored in a hash table. |
|
To simplify memory handling, the gc should be used consistently on all |
heap structs, regardless whether they are pike visible things or not. |
An interesting question is whether the type info for every struct |
(more concretely, the address of some area where the gc can find the |
functions it needs to handle the struct) is carried in the struct |
itself (through a new pointer field), or if it continues to be carried |
in the context for every pointer to the struct (e.g. in the type field |
in svalues). |
|
Since the gc would be used for most internal structs as well, which |
are almost exclusively used via compile-time typed pointers, it would |
probably save significant heap space to retain the type in the pointer |
context. It does otoh complicate the gc - everywhere where the gc is |
fed a pointer to a thing, it must also be fed a type info pointer, and |
the gc must then keep track of this data tuple internally. |
|
|
Issue: Immediate destruct/free when refcount reaches zero |
|
When a thing in Pike runs out of references, it's destructed and freed |
almost immediately in the pre-multi-cpu implementation. This behavior |
in Pike is used implicitly in many places. The major (hopefully all) |
principal use cases of concern are: |
|
1. It's popular to make code that releases a lock timely by just |
storing it in a local variable that gets freed when the function |
exits (either by normal return or by exception). E.g: |
|
void foo() { |
Thread.MutexKey my_lock = my_mutex->lock(); |
... do some work ... |
// my_lock falls out of scope here when the function exits |
// (also if it's due to a thrown exception), so the lock is |
// released right away. |
} |
|
There's also code that opens files and sockets etc, and expects |
them to be automatically closed again through this method. (That |
practice has been shown to be bug prone, though, so in the sources |
at Roxen many of those places have been fixed over time.) |
|
2. In some cases, structures are carefully kept acyclic to make them |
get freed quickly, and there is no control of which party that got |
the "last reference". |
|
One example is if a cache holds one ref to an entry, and there |
might at the same time be one or more worker threads that hold |
references to the same entry while they use it. In this case the |
cache can be pruned safely by dropping the reference to the entry, |
without destructing it. |
|
A variant when the structure cannot be made acyclic is to make a |
"wrapper object": It holds a reference to the cyclic structure, |
and all other parties makes sure to hold a ref to the wrapper as |
long as they got interest in any part of the data. When the |
wrapper runs out of refs, it destructs the cyclic structure |
explicitly. |
|
These tricks have mostly been used to reduce the amount of cyclic |
garbage that requires the stop-the-world gc to run more often, but |
there are also occasions when the structure holds open fd's which |
must be closed without delay (one such occasion is the connection |
fd in the http protocol in the Roxen WebServer). |
|
3. In some applications with extremely high data mutation rate, the |
immediate freeing of acyclic structures is seen as a prerequisite |
to keep bounds on memory consumption. |
|
4. FIXME: Are there more? |
|
The proposed gc (c.f. issue "Garbage collector") does not retain the |
immediate destruct and free semantic - only the gc running in its own |
thread may free things. Although it would run much more often than the |
old gc (probably on the order of once a minute up to several times a |
second), it would still break this semantic. To discuss each use case |
above: |
|
1. Locks, and in some cases also open fd's, cannot wait until the |
next gc run. |
|
Observing that mutex locks always are thread local things, almost |
all these cases (exceptions are possibly fd objects that somehow |
are shared anyway) can be solved by a modified gc approach - see |
issue "Micro-gc". |
|
Since the micro-gc approach appears to be expensive, it's worth |
considering to actually ditch this behavior and solve the problem |
on the pike level instead. The compiler can be used to detect many |
of these cases by looking for assignments to local variables that |
aren't accessed from anywhere (there is already such a warning, |
but it has been tuned down just to allow this problematic idiom). |
|
A new language construct would be necessary, to ensure that the |
variable gets destructed both on normal function exit and when an |
exception is thrown. It could look something like this: |
|
void foo() { |
destruct_on_exit (Thread.MutexKey my_lock = my_mutex->lock()) { |
... do some work which requires the lock ... |
} |
} |
|
I.e. the destruct_on_exit clause ensures that the variable(s) in |
the parentheses are destructed (regardless of the amount of refs) |
if execution passes out of the block in any way. |
|
Anyway, since implementing the micro-gc is a comparatively small |
amount of extra work, the intention is to do that first, and then |
later implement the full gc as an experimental mode so that |
performance can be compared. |
|
2. This is not a problem as long as the reason only is gc efficiency. |
It's worth noting that tricks such as "wrapper objects" still have |
some use since they lessen the load on the background cycle |
detector. |
|
It is however a problem if there are open fd's or similar things |
in the structure. It doesn't look like this is feasible to solve |
internally; such structures typically are shared data, and letting |
different threads reference shared data without locking is |
essential for multi-cpu performance. This is therefore a case that |
is probably best to solve on the pike level instead, possibly |
through pike-visible refcounting. These cases appear to be fairly |
few, at least. |
|
3. If the solution in the issue "Micro-gc" is implemented, this |
problem hardly exists at all since thread local data is refcounted |
and freed almost exactly the same way as before. |
|
Otherwise, since the gc thread operate only on the new and changed |
data, and collects newly allocated data very efficiently, it would |
keep up with a very high mutation rate. GC runs are scheduled to |
run just often enough to keep the heap size within a set limit - |
as long as the gc thread doesn't become saturated and runs |
continuously, it offloads the refcounting and freeing overhead |
from the worker threads completely. |
|
If the data mutation rate is so high that the gc thread becomes |
saturated, what would happen is that malloc calls would start to |
block when the heap limit is reached. Research shows that a |
periodic gc done right provides considerably more throughput than |
pure refcounting, so the application would still run faster |
including that blocking. |
|
The remaining concern is then that the blocking would introduce |
uneven response times - the worker threads would go very fast most |
of the time but every once in a while they could hang waiting on |
the gc thread. These hangs are (according to the research paper) |
on the order of milliseconds, but if they still are problematic |
then a crude solution would be to introduce artificial short |
sleeps in the working threads to bring down the mutation rate - |
even with those sleeps the application would probably still be |
significantly faster than the current approach. |
|
|
Issue: Micro-gc |
|
A way to retain the immediate-destruct (and free) semantic for thread |
local things referenced only from the pike stack is to implement a |
"micro-gc" that runs very quickly and is called often enough to keep |
the semantic. |
|
To begin with, the mark-and-sweep gc for new data (as discussed in the |
issue "Garbage collector") is not implemented, and the refcounts for |
thread local things are not delay-updated at all. The work of the |
micro-gc then becomes to free all things in the zero-count table (ZCT) |
that aren't referenced from the thread's C and pike stacks. |
|
Scanning the two stacks completely in every micro-gc would be too |
expensive. That is solved by partitioning the ZCT so that every pike |
stack frame gets one of its own. New zero-count things are always put |
in the ZCT for current topmost frame. |
|
That way, the micro-gc can scan the topmost parts of the stacks (above |
the last pike stack frame) for references to things in the topmost |
ZCT, and when a pike stack frame is popped then the things in its ZCT |
can be freed without scanning at all. This is enough to timely |
destruct and free the things put on the pike stack. |
|
Furthermore, since the old immediate-destruct semantics only requires |
destructing before and after every pike level function call, it won't |
be necessary for the micro-gc to scan the C stack at all. That since |
there never is any part of it above the current frame, i.e. above the |
innermost mega_apply, to scan. |
|
Note that the above works under the assumption that new things are |
only referenced from the stacks in or below the current frame. That's |
not always true - code might change the stack further back to |
reference new things, e.g. if a function allocates some temporary |
struct on the stack and then pass the pointer to it to subroutines |
that change it. |
|
Such code on the C level is very unlikely, since it would mean that C |
code would be changing something on the C stack back across a pike |
level apply. |
|
On the Pike level it can occur with inner functions changing variables |
in their surrounding functions. Those cases can however be handled by |
following the pointer chain (pike_frame.scope) to those function |
scopes, and that pointer chain is never deeper than the number of |
lexically nested functions. |
|
This micro-gc approach comes at a considerable expense compared to the |
solution described in the issue "Garbage collector": Not only does the |
generational gc with mark-and-sweep for young data disappear (which |
according to the research paper gives 15-40% more total throughput), |
but the delayed updating of the refcounts disappear to a large extent |
too. Refcounting from the stacks is still avoided though, and delayed |
updating of refcounts in shared data is still done, which is crucial |
for multi-cpu performance. |
|
|
Issue: Single-refcount optimizations |
|
Pre-multi-cpu Pike makes use of the refcounting to optimize |
operations: Some operations that shouldn't be destructive on their |
operands can be destructive anyway on an operand if it has no other |
references. A common case in adding elements to arrays: |
|
array arr = ({}); |
while (...) |
arr += ({another_element}); |
|
Here arr only got a single reference from the stack, so the += |
operator destructively grows the array to add new elements to the end |
of it. |
|
With the new gc approach, such single-refcount optimizations no longer |
work in general. This is the case even if the micro-gc is implemented, |
since stack refs aren't counted. |
|
The primary case when these optimizations make a big difference is |
when data structures (mostly arrays and strings) are built with code |
like above. The characteristics for this case are: |
|
o The thing being built is thread local. |
o The thing being built only got references from the stack. |
o The thing being built might be passed to subfunctions, but they |
have returned when the would-be destructive operation take place. |
o The construct to optimize only occurs in Pike code (from C there |
are ways to explicitly request destructive updates). |
|
With the micro-gc, all refs from outside the two stacks are counted in |
real-time, so it's easy to detect if a thread local thing got |
non-stack references. The remaining problem is therefore only when |
there are several references to the same thing on the pike stack of |
the local function. That is uncommon, but it still requires attention; |
in the following case b has to be ({0}) and not ({0,1,2}) after the |
loop: |
|
array(int) a = ({0}), b = a; |
for (int i = 1; i <= 2; i++) |
a += ({i}); |
|
In simple cases, it appears easy to detect multiple stack references |
at compile time and disable the optimization. However, a simple check |
for a direct assignment from one stack variable to another is not |
foolproof. Consider: |
|
array(int) a = ({0}); |
array(array(int)) x = ({a}); |
array(int) b = x[0]; |
x = 0; |
for (int i = 1; i <= 2; i++) |
a += ({i}); |
|
It is no longer obvious to the compiler that b got the same array as |
a, and when 0 is assigned to x the non-stack ref disappears, so the |
array in a and b got refcount zero when the += operation takes place. |
|
The defensive way to cope is therefore to only allow destructive |
updates when (in addition to the conditions listed earlier) there is |
only one stack position in the current frame, or any surrounding frame |
reachable from the current function, that might contain a ref to the |
thing at the point where the destructive update is to take place. I.e. |
it must be clear that all other reachable stack positions cannot |
contain a ref. |
|
More or less complicated compile-time analysis can be used to check |
that, but it's not far fetched to believe that there can be situations |
in current code that the compiler can't analyze well enough. |
|
Also, the analysis above assumes the micro-gc, which is a less than |
optimal solution in itself and probably something to be ditched |
eventually. In its absense this problem becomes a lot more difficult. |
|
The approach above can also be applied to mappings, multisets and |
objects (sporting `+=) being built the same way. Strings are also |
common in this use case, but they require a different solution since |
they are always shared, i.e. not thread local. |
|
For strings, the compiler could detect string variables being modified |
through +=, and in such cases emit code that treats them as string |
builders (a string builder is an unfinished string that can be |
modified, it has not been hashed into the global string table, and it |
cannot be used in comparisons etc). Then += can be implemented with a |
string_builder_append, and every time the string builder is being used |
in a string context, the string builder content gets converted to a |
real string. The string builder itself is bound to the specific |
variable on the stack, and it cannot get other references. |
|
The string approach cannot be used for other data types since they can |
be modified destructively. Consider: |
|
array(int) a = ({0}), b = a; |
b[0] = 1; |
for (int i = 1; i <= 2; i++) |
a += ({i}); |
|
If a was an "array builder" here then the assignment b = a would |
implicitly copy the array, but the assignment to b[0] should affect a |
too, because at that point both a and b refer to the same array. |
|
To conclude, the current single-refcount optimizations in these common |
cases can be solved in other ways, but not completely, and it would |
require quite a bit of work in the compiler. |
|
Another approach is to introduce language constructs, like |
String.Buffer, to do destructive updates explicitly. That would also |
allow destructive updates even when there are intentional multiple |
refs (the lack of such tools is a drawback in the current |
implementation). The problem is that old code needs some rewriting to |
keep its performance. |
|
FIXME: Are there other important single-refcount optimization cases? |
|
|
Issue: Weak ref garbage collection |
|
Each thing has two refcounters - one for total number of refs and |
another for the number of weak refs. The thing is semantically freed |
when they become equal. The problem is that it still got refs which |
might be followed later, so the gc still cannot free it. |
|
There are two ways to tackle this problem: |
|
1. Keep track of all the weak pointers that point to each thing, so |
that they can be followed backwards and cleared when only weak |
pointers are left. |
|
That tracking requires additional data structures and the associated |
overhead. Clearing the other pointers would be done by the gc thread, |
which presents a problem how to do that for things which are write or |
read-constant locked. There are several alternatives: |
|
a. The gc thread clears all pointers for unlocked things (locking |
them while doing so), and for locked things leaves a work list of |
clearings to do by the locking thread when it is about to release |
the lock. |
|
Care must be taken to handle the case that the thing containing |
the pointer becomes unreferenced and freed by the gc before the |
lock is released. |
|
b. Make asynchronous clearing part of the semantics for weak |
pointers, i.e. no lock can stop a weak pointer from being cleared. |
Reading and clearing such pointers must then be atomic. (A thread |
can always ensure specific pointers aren't cleared in inconvenient |
situations by having an extra nonweak reference to the thing, e.g. |
on the stack.) |
|
As long as the gc threads clears the pointers, there is no risk that |
the things containing them gets freed, since only the gc thread might |
do that. There is however a risk that the pointers have changed; CAS |
is necessary, and the structure tracking the reverse dependencies |
should be lock-free. |
|
2. Free all refs emanating from the thing with only weak pointers |
left, and keep it as an empty structure (a destructed object, an empty |
array/multiset/mapping, or an empty skeleton program which contains no |
identifiers). |
|
This approach requires a flag to recognize such semi-freed things, and |
that all code that dereference weak pointers check for it. A problem |
is that data blocks remain allocated longer than necessary, maybe even |
indefinitely. That can be mitigated to some degree by shortening them |
using realloc(3). |
|
|
Issue: Moving things between lock spaces |
|
Things can be moved between lock spaces, or be made thread local or |
disowned. In all these cases, one or more things are given explicitly. |
It's natural if not only those things are moved, but also all other |
things in the same source lock space that are referenced from the |
given things and not from anywhere else (this operation is the same as |
Pike.count_memory does). In the case of making things thread local or |
disowned, it is also necessary to check that the explicitly given |
things aren't referenced from elsewhere. |
|
FIXME: This is a problem with the proposed garbage collector (see |
issue "Garbage collector"). Old things got refcounts that can be used, |
but they might be stale, and the logging doesn't provide information |
in the form we need. New things are even worse since they got no |
refcounts at all that can be used to check for outside refs. |
Furthermore, there is a race since an external ref can be added at any |
time from any thread. |
|
All this is settled when the gc is run: If the "controlled" refs are |
temporarily ignored then the set to move is the one that would turn |
into garbage. But it is not good to either have to wait for the gc or |
run it synchronously. |
|
Also, the problem above applies to Pike.count_memory too. |
|
|
Issue: Strings |
|
Strings are unique in Pike. This property is hard to keep if threads |
have local string pools, since a thread local string might become |
shared at any moment, and thus would need to be moved. Therefore the |
string hash table remains global, and lock congestion is avoided with |
some concurrent access hash table implementation. See issue "Lock-free |
hash table". |
|
Lock-free is a good start, but the hash function must also provide a |
good even distribution to avoid hotspots. Pike currently uses an |
in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a |
more widespread and better studied alternative should be considered. |
There seems to be few that are below O(n) (which DO_HASHMEM is), |
though. |
|
|
Issue: Types |
|
Like strings, types are globally unique and always shared in Pike. |
That means lock-free access to them is desirable, and it should also |
be doable fairly easily since they are constant. Otoh it's probably |
not as vital as for strings since types typically only are built |
during compilation. |
|
|
Issue: Mapping and multiset data blocks |
|
Mappings and multisets currently have a deferred copy-on-write |
behavior, i.e. several mappings/multisets can share the same data |
block and it's only copied to a local one when changed through a |
specific mapping/multiset. |
|
If mappings and/or multisets are changed to be lock-free then the |
copy-on-write behavior needs to be solved: |
|
o A flag is added to the mapping/multiset data block that is set |
whenever it is shared. |
o Every destructive operation checks the flag. If set, it makes a |
copy, otherwise it changes the original block. Thus the flag is |
essentially a read-only marker. |
o In addition to the flag, the gc performs normal refcounting. It |
clears the flag if the refcount is 1. (The refcount cannot be used |
directly since it's delay-updated.) |
o Hazard pointers are necessary for every destructive access, |
including the setting of the flag. The reason is that the |
read-onlyness only is in effect after all currently modifying |
threads are finished with the block. The thread that is setting the |
flag therefore has to wait until there are no other hazard pointers |
to the block before returning. |
|
It's a good question whether keeping the copy-on-write feature is |
worth this overhead. Of course, an alternative is to simply let the |
builtin mappings and/or multisets be locking, and instead have special |
objects that implements lock-free data types. |
|
Another issue is if things like mapping/multiset data blocks should be |
first or second class things (c.f. issue "Memory object structure"). |
If they're second class it means copy-on-write behavior doesn't work |
across lock spaces. If they're first class it means additional |
overhead handling the lock spaces of the mapping data blocks, and if a |
mapping data is shared between lock spaces then it has to be in some |
third lock space of its own, or in the global lock space, neither of |
which would be very good. So it doesn't look like there's a better way |
than to botch copy-on-write in this case. |
|
|
Issue: Emulating the interpreter lock |
|
For compatibility with old C modules, and for the _disable_threads |
function, it is necessary to retain a complete lock like the current |
interpretator lock. It has to lock the global area for writing, and |
also stop all access to all lock spaces, since the thread local data |
might refer to any lock space. |
|
This lock is implemented as a read/write lock, which normally is held |
permanently for reading by all threads. Only when a thread is waiting |
to acquire the compat interpreter lock is it released as each thread |
goes into check_threads(). |
|
This lock cannot wait for explicit lock space locks to be released. |
Thus it can override the assumption that a lock space is safe from |
tampering by holding a write lock on it. Still, it's only available |
from the C level (with the exception of _disable_threads) so the |
situation is not any different from the way the interpreter lock |
overrides Thread.Mutex today. |
|
|
Issue: Function calls |
|
A lock on an object is almost always necessary before calling a |
function in it. Therefore the central apply function (mega_apply) must |
ensure an appropriate lock is taken. Which kind of lock |
(read-safe/read-constant/write - see issue "Lock space lock |
semantics") depends on what the function wants to do. Therefore all |
object functions are extended with flags for this. |
|
The best default is probably read-safe. Flags for no locking (for the |
few special cases where the implementations actually are completely |
lock-free) and for compat-interpreter-lock-locking should probably |
exist as well. A compat-interpreter-lock flag is also necessary for |
global functions that don't have a "this" object (aka efuns). |
|
Having the required locking declared this way also alleviates each |
function from the burden of doing the locking to access the current |
storage, and it allows future compiler optimizations to minimize lock |
operations. |
|
|
Issue: Exceptions |
|
"Forgotten" locks after exceptions shouldn't be a problem: Explicit |
locks are handled just like today (i.e. it's up to the pike |
programmer), and implicit locks can safely be released when an |
exception is thrown. |
|
One case requires attention: An old-style function that requires the |
compat interpreter lock might catch an error. In that case the error |
system has to ensure that lock is reacquired. |
|
|
Issue: C module interface |
|
A new add_function variant is probably added for new-style functions. |
It takes bits for the flags discussed for issue "Function calls". |
New-style functions can only assume free access to the current storage |
according to those flags; everything else must be locked (through a |
new set of macros/functions). |
|
Accessor functions for data types (e.g. add_shared_strings, |
mapping_lookup, and object_index_no_free) handles the necessary |
locking internally. They will only assume that the thing is safe, i.e. |
that the caller ensures the current thread controls at least one ref. |
|
THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in |
new-style functions. |
|
There will be new GC callbacks for walking module global pointers to |
things (see issue "Garbage collection and external references"). |
|
The proposed gc requires that every pointer change in a (heap |
allocated) thing is tracked (for pointers that might point to other |
heap allocated things). This is because the gc has to log the old |
state of the pointers before the first change after a gc run (see |
issue "Garbage collector", item c). For all builtin data types, this |
is handled internally in primitives like mapping_insert and |
object_set_index, so the only cases that the C module code typically |
has to handle are direct updates in the current storage. Therefore all |
pointer changes that currently looks someting like |
|
THIS->my_thing = some_thing; |
|
must be wrapped in some kind of macro/function call to become: |
|
set_ptr (THIS, my_thing, some_thing); |
|
On the positive side, all the refcount twiddling to account for |
references from the C and pike stacks can be removed from the C code. |
That also includes a lot of the SET_ONERROR stuff which currently is |
necessary to avoid lost refs when errors are thrown. |
|
|
Issue: C module compatibility |
|
Currently it doesn't look like the goal to keep a source-level |
compatibility mode for C modules can be achieved. The problem is that |
every pointer assignment in every heap allocated thing must be wrapped |
inside a macro/function call to make the new gc work (see issue |
"Garbage collector", item c), and lots of C module code change such |
pointers directly through plain assignments. |
|
Ref issue "Emulating the interpreter lock". |
|
Ref issue "Garbage collection and external references". |
|
|
Issue: Garbage collection and external references |
|
The current gc design is that there is an initial "check" pass that |
determines external references by counting all internal references, |
and then for each thing subtract it from its refcount. If the result |
isn't zero then there are external references (e.g. from global C |
variables or from the C stack) and the thing is not garbage. |
|
The new gc (c.f. issue "Garbage collector") does not refcount external |
refs and refs from the C or Pike stacks. It needs to find them some |
other way: |
|
References from global C variables are few, so they can be dealt with |
by requiring C modules and the core parts to provide callbacks that |
lets the gc walk through them (see issue "C module interface"). This |
is however not compatible with old C modules. |
|
References from C stacks are common, and it is infeasible to require |
callbacks that keep track of them. The gc instead has to scan the C |
stacks for the threads and treat any aligned machine word containing |
an apparently valid pointer to a gc candidate thing as an external |
reference. This is the common approach used by standalone gc libraries |
that don't require application support. For reference, here is one |
such garbage collector, written in C++: |
http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/Introduction.html#//apple_ref/doc/uid/TP40002427 |
Its source is here: |
http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/ |
|
The same approach would also be necessary to cope with old C modules |
(see issue "C module compatibility"), but since global C level |
pointers are few, it might not be mandatory to get this working. And |
besides, it appears unlikely that compatibility with old C modules can |
be kept. |
|
|
Issue: Global pike level caches |
|
Global caches that are shared between threads are common, and in |
almost all cases such caches are implemented using mappings. There's |
therefore a need for (at least) a hash table data type that handle |
concurrent access and high mutation rates very efficiently. |
|
Issue "Lock-free hash table" discusses such a solution. It's currently |
not clear whether the builtin mappings will be lock-free or not (c.f. |
the copy-on-write problem in issue "Mapping and multiset data |
blocks"), but if they're not then a mapping-like object class is |
implemented that is lock-free. It's easy to replace global cache |
mappings with such objects. |
|
|
Issue: Thread.Queue |
|
A lock-free implementation should be used. The things in the queue are |
typically disowned to allow them to become thread local in the reading |
thread. |
|
|
Issue: "Relying on the interpreter lock" |
|
FIXME |
|
|
Issue: False sharing |
|
False sharing occurs when thread local things used frequently by |
different threads are next to each other so that they share the same |
cache line. Thus the cpu caches might force frequent resynchronization |
of the cache line even though there is no apparent hotspot problem on |
the C level. |
|
This can be a problem in particular for all the block_alloc pools |
containing small structs. Using thread local pools is seldom a |
workable solution since most thread local structs might become shared |
later on. |
|
One way to avoid it is to add padding (and alignment). Cache line |
sizes are usually 64 bytes or less (at least for Intel ia32). That |
should be small enough to make this viable in many cases. |
|
FIXME: Check cache line sizes on the other important architectures. |
|
Another way is to move things when they get shared, but that is pretty |
complicated and slow. |
|
|
Issue: Malloc and block_alloc |
|
Standard OS mallocs are usually locking. Bundling a lock-free one |
could be important. FIXME: Survey free implementations. |
|
Block_alloc is a simple homebrew memory manager used in several |
different places to allocate fixed-size blocks. The block_alloc pools |
are often shared, so they must allow efficient concurrent access. With |
a modern malloc, it is possible that the need for block_alloc is gone, |
or perhaps the malloc lib has builtin support for fixed-size pools. |
Making a lock-free implementation is nontrivial, so the homebrew ought |
to be ditched in any case. |
|
A problem with ditching block_alloc is that there is some code that |
walks through all allocated blocks in a pool, and also avoids garbage |
by freeing the whole pool altogether. FIXME: Investigate alternatives |
here. |
|
See also issue "False sharing". |
|
|
Issue: Heap size control |
|
There should be better tools to control the heap size. It should be |
possible to set the wanted heap size so that the gc runs timely before |
that limit is reached. Pike should detect the available amount of real |
memory (i.e. not counting swap) to use as default. The gc should still |
use a garbage projection strategy to keep the process below the |
configured maximum size for as long as possible. This is more |
important if the gc is used also for previously refcounted garbage |
(c.f. issue "Garbage collector"). |
|
Malloc calls should be wrapped to allow the gc to run in blocking mode |
in case they fail. |
|
|
Issue: The compiler |
|
FIXME |
|
|
Issue: Foreign thread visits |
|
FIXME. JVM threads.. |
|
|
Issue: Pike security system |
|
It is possible that keeping the pike security system intact would |
complicate the implementation, and even if it was kept intact a lot of |
testing would be required before one can be confident that it really |
works (and there are currently very few tests for it in the test |
suite). |
|
Also, the security system isn't used at all to my (mast's) knowledge, |
and it is not even compiled in by default (has to be enabled with a |
configure flag). |
|
All this leads to the conclusion that it is easiest to ignore the |
security system altogether, and if possible leave it as it is with the |
option to get it working later. |
|
|
Issue: Contention-free counters |
|
There is probably a need for contention-free counters in several |
different areas. They should be possible to update from several |
threads in parallell without synchronization. Querying the current |
count is always approximate since it can be changing simultaneously in |
other threads. However, the thread's own local count is always |
accurate. |
|
They should be separated from the blocks they apply to, to avoid cache |
line invalidation of those blocks. |
|
To accomplish that, a generic tool somewhat similar to block_alloc is |
created that allocates one or more counter blocks for each thread. In |
these blocks indexes are allocated, so a counter is defined by the |
same index into all the thread local counter blocks. |
|
Each thread can then modify its own counters without locking, and it |
typically has its own counter blocks in the local cache while the |
corresponding main memory is marked invalid. To query a counter, a |
thread would need to read the blocks for all other threads. |
|
This means that these counters are efficient for updates but less so |
for queries. However, since queries always are approximate, it is |
possible to cache them for some time (e.g. 1 ms). Each thread would |
need its own cache though, since the local count cannot be cached. |
|
It should be lock-free for allocating and freeing counters, and |
preferably also for starting and stopping threads (c.f. issue "Foreign |
thread visits"). In both cases the freeing steps represents a race |
problem - see issue "Hazard pointers". To free counters, the counter |
index would constitute the hazard pointer. |
|
|
Issue: Lock-free hash table |
|
A good lock-free hash table implementation is necessary. A promising |
one is http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html. |
It requires a CAS (Compare And Swap) instruction to work, but that |
shouldn't be a problem. The java implementation |
(http://sourceforge.net/projects/high-scale-lib) is Public Domain. In |
the comments there is talk about efforts to make a C version. |
|
It supports (through putIfAbsent) the uniqueness requirement for |
strings, i.e. if several threads try to add the same string (at |
different addresses) then all will end up with the same string pointer |
afterwards. |
|
The java implementation relies on the gc to free up the old hash |
tables after resize. The proposed gc (issue "Garbage collector") would |
solve it for us too, but even without that the problem is still |
solvable - see issue "Hazard pointers". |
|
|
Issue: Hazard pointers |
|
A problem with most lock-free algorithms is how to know no other |
thread is accessing a block that is about to be freed. Another is the |
ABA problem which can occur when a block is freed and immediately |
allocated again (common for block_alloc). |
|
Hazard pointers are a good way to solve these problems without leaving |
the blocks to the garbage collector (see |
http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf). So a |
generic hazard pointer tool might be necessary for blocks not known to |
the gc. |
|
Note however that a more difficult variant of the ABA problem still |
can occur when the block cannot be freed after leaving the data |
structure. (In the canonical example with a lock-free stack - see e.g. |
"ABA problem" in Wikipedia - consider the case when A is a thing that |
continues to live on and actually gets pushed back.) The only reliable |
way to cope with that is probably to use wrappers. |
|
|
Issue: Thread local storage |
|
Implementation would be considerably simpler if working TLS can be |
assumed on the C level, through the __thread keyword (or |
__declspec(thread) in Visual C++). A survey of the support for TLS in |
common compilers and OS'es is needed to decide whether this is an |
workable assumption: |
|
o GCC: __thread is supported. Source: Wikipedia. |
FIXME: Check from which version. |
|
o Visual C++: __declspec(thread) is supported. Source: Wikipedia. |
FIXME: Check from which version. |
|
o Intel C compiler: Support exists. Source: Wikipedia. |
FIXME: Check from which version. |
|
o Sun C compiler: Support exists. Source: Wikipedia. |
FIXME: Check from which version. |
|
o Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works |
for dynamic libs. C.f. http://people.redhat.com/drepper/tls.pdf. |
FIXME: Check from which version of glibc and kernel (if relevant). |
|
o Windows (i386, x86_64): TLS is supported but does not always work |
in dll's loaded using LoadLibrary (which means all dynamic modules |
in pike). C.f. http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx. |
According to Wikipedia this is fixed in Vista and Server 2008 |
(FIXME: verify). In any case, TLS is still usable in the pike core. |
|
o MacOS X: FIXME: Check this. |
|
o Solaris: FIXME: Check this. |
|
o *BSD: FIXME: Check this. |
|
|
Issue: Platform specific primitives |
|
Some low-level primitives, such as CAS and fences, are necessary to |
build the various lock-free tools. A third-party library would be |
useful. |
|
o An effort to make a standardized library is here: |
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html |
(C level interface at the end). It apparently lacks implementation, |
though. |
|
o The linux kernel is reported to contain a good abstraction lib for |
these primitives, along with implementations for a large set of |
architectures (see |
http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08.21a.pdf). |
|
o Another one is part of a lock-free hash implementation here: |
http://www.sunrisetel.net/software/devtools/sunrise-data-dictionary.shtml |
It has a MIT-style open source license (with ad clauses). |
|
It appears that the libraries themselves are very short and simple; |
the difficult part is rather to specify the semantics carefully. It's |
probably easiest to make one ourselves with ideas from e.g. the linux |
kernel paper mentioned above. |
|
Required operations: |
|
CAS(address, old_value, new_value) |
Compare-and-set: Atomically sets *address to new_value iff its |
current value is old_value. Needed for 32-bit variables, and on |
64-bit systems also for 64-bit variables. |
|
ATOMIC_INC(address) |
ATOMIC_DEC(address) |
Increments/decrements *address atomically. Can be simulated with |
CAS. 32-bit version necessary, 64-bit version would be nice. |
|
LFENCE() |
Load fence: All memory reads in the thread before this point are |
guaranteed to be done (i.e. be globally visible) before any |
following it. |
|
SFENCE() |
Store fence: All memory writes in the thread before this point are |
guaranteed to be done before any following it. |
|
MFENCE() |
Memory fence: Both load and store fence at the same time. (On many |
architectures this is implied by CAS etc, but we shouldn't assume |
that.) |
|
The following operations are uncertain - still not known if they're |
useful and supported enough to be required, or if it's better to do |
without them: |
|
CASW(address, old_value_low, old_value_high, new_value_low, new_value_high) |
A compare-and-set that works on a double pointer size area. |
Supported on more modern x86 and x86_64 processors (c.f. |
http://en.wikipedia.org/wiki/Compare-and-swap#Extensions). |
|
FIXME: More.. |
|
Survey of platform support: |
|
o Windows/Visual Studio: Got "Interlocked Variable Access": |
http://msdn.microsoft.com/en-us/library/ms684122.aspx |
|
o FIXME: More.. |
|
|
Issue: Preemptive thread suspension |
|
The proposed gc as presented in the research paper needs to suspend |
and resume other threads. A survey of platform support for preemptive |
thread suspension: |
|
o POSIX threads: No support. Deprecated and removed from the standard |
since it can very easily lead to deadlocks. On some systems there |
might still be a pthread_suspend function. |
|
o Windows: SuspendThread and ResumeThread exists but are only |
intended for use by debuggers. |
|
It's clear that a nonpreemptive method is required. See issue "Garbage |
collector" item g for details on that. |
|
|
Issue: OpenMP |
|
OpenMP (see www.openmp.org) is a system to parallelize code using |
pragmas that are inserted into the code blocks. It can be used to |
easily parallelize otherwise serial internal algorithms like searching |
and all sorts of loops over arrays etc. Thus it addresses a different |
problem than the high-level parallelizing architecture above, but it |
might provide significant improvements nevertheless. |
|
It's therefore worthwhile to look into how this can be deployed in the |
Pike sources. If support is widespread enough, it could be considered |
to even make it a requirement to be able to deploy the builtin tools |
for atomicity and ordering (provided they are useful outside the omp |
parallellized blocks). |
|
Compiler support (taken from www.openmp.org): |
|
o gcc since 4.3.2. |
o Microsoft Visual Studio 2008 or later. |
o Sun compiler (starting version unknown). |
o Intel compiler since 10.1. |
o ..and some more. |
|
FIXME: Survey platform-specific limitations. |
|
|
Various links |
|
Pragmatic nonblocking synchronization for real-time systems |
http://www.usenix.org/publications/library/proceedings/usenix01/full_papers/hohmuth/hohmuth_html/index.html |
DCAS is not a silver bullet for nonblocking algorithm design |
http://portal.acm.org/citation.cfm?id=1007945 |
A simple and efficient memory model for weakly-ordered architectures |
http://www.open-std.org/Jtc1/sc22/WG21/docs/papers/2007/n2237.pdf |
|