d60f152009-04-19Martin Stjernholm Multi-cpu support in Pike ------------------------- This is a draft spec for how to implement multi-cpu support in Pike. The intention is that it gets extended along the way as more issues gets ironed out. Discussions take place in "Pike dev" in LysKOM or pike-devel@lists.lysator.liu.se. Initial draft created 8 Nov 2008 by Martin Stjernholm. Background and goals Pike supports multiple threads, but like many other high-level languages it only allows one thread at a time to access the data structures. This means that the utilization of multi-cpu and multi-core systems remains low, even though there are some modules that can do isolated computational tasks in parallell (e.g. the Image module). It is the so-called "interpreter lock" that must be locked to access any reference variable (i.e. everything except floats and native integers). This lock is held by default in essentially all C code and is explicitly unlocked in a region by the THREADS_ALLOW/ THREADS_DISALLOW macros. On the pike level, the lock is always held - no pike variable can be accessed and no pike function can be called otherwise. The purpose of the multi-cpu support is to rectify this. The design goals are, in order of importance: 1. Pike threads should be able to execute pike code concurrently on multiple cpus as long as they only modify thread local pike data and read a shared pool of static data (i.e. the pike programs, modules and constants). 2. There should be as few internal hot spots as possible (preferably none) when pike code is executed concurrently. Care must be taken to avoid internal synchronization, or updates of shared data that would cause "cache line ping-pong" between cpus. 3. The concurrency should be transparent on the pike level. Pike code should still be able to access shared data without locking and without risking low-level inconsistencies. (So Thread.Mutex etc would still be necessary to achieve higher level synchronization.) 4. There should be tools on the pike level to allow further performance tuning, e.g. lock-free queues, concurrent access hash tables, and the possibility to lock different regions of shared data separately. These tools should be designed so that they are easy to slot into existing code with few changes. 5. There should be tools to monitor and debug concurrency. It should be possible to make assertions that certain objects aren't shared, and that certain access patterns don't cause thread synchronization. This is especially important if goal (3) is realized, since the pike code by itself won't show what is shared and what is thread local. 6. C modules should continue to work without source level modification (but likely without allowing any kind of concurrency). Note that even if goal (3) is accomplished, this is no miracle cure that would make all multithreaded pike programs run with optimal efficiency on multiple cpus. One could expect better concurrency in old code without adaptions, but it could still be hampered considerably by e.g. frequent updates to shared data. Concurrency is a problem that must be taken into account on all levels. Other languages Perl: All data is thread local by default. Data can be explicitly shared, in which case Perl ensures internal consistency. Every shared variable is apparently locked individually. Referencing a thread local variable from a shared one causes the thread to die. See perthrtut(1). Python: Afaik it's the same state of affairs as Pike. Solution overview The basic approach is to divide all data into thread local and shared: o Thread local data is everything that is accessible to one thread only, i.e. there are no references to anything in it from shared data or from any other thread. This is typically data that the current thread has created itself and only reference from the stack. The thread can access its local data without locking. o Shared data is everything that is accessible from more than one thread. Access to it is synchronized using a global read/write lock, the so-called "global lock". I.e. this lock can either be locked for reading by many threads, or be locked by a single thread for writing. Locking the global lock for writing is the same as locking the interpreter lock in current pikes. (This single lock is refined later - see issue "Lock spaces".) o There is also a special case where data can be "disowned", i.e. not shared and not local in any thread. This is used in e.g. Thread.Queue for the objects that are in transit between threads. Disowned data cannot have arbitrary references to it - it must always be under the control of some object that in some way ensures consistency. (Garbage could be made disowned since it by definition no longer is accessible from anywhere, but of course it is always better to clean it up instead.) +--------+ +---------------------+ Direct +--------+ | |<-- refs --| Thread 1 local data |<- - access - -| | | | +---------------------+ | Thread | | | | 1 | | |<- - - - Access through global lock only - - - -| | | Shared | +--------+ | | | data | +---------------------+ Direct +--------+ | |<-- refs --| Thread 2 local data |<- - access - -| | | | +---------------------+ | Thread | | | | 2 | | |<- - - - Access through global lock only - - - -| | | | +--------+ +--------+ ... etc ... The principal use case for this model is that threads can do most of their work with local data and read access to the shared data, and comparatively seldom require the global write lock to update the shared data. Every shared thing does not have its own lock since that would cause excessive lock overhead. Note that the shared data is typically the same as the data referenced from the common environment (i.e. the "global data"). Also note that the current object (this) always is shared in pike modules, so a thread cannot assume free access to it. In other pike classes it would often be shared too, but it is still important to utilize the situation when it is thread local. See issue "Function calls". A thread local thing, and all the things it references directly or indirectly, automatically becomes shared whenever it gets referenced from a shared thing. A shared thing never automatically becomes thread local, but there is a function to explicitly "take" it. It would first have to make sure
1830ba2009-04-19Martin Stjernholm there are no references to it from shared or other thread local things (c.f. issue "Moving things between lock spaces"). Thread.Queue has a special case so that if a thread local thing with no other refs is enqueued, it is disowned by the current thread, and later becomes thread local in the thread that dequeues it.
d60f152009-04-19Martin Stjernholm  Issue: Lock spaces Having a single global read/write lock for all shared data could become a bottleneck. Thus there is a need for shared data with locks separate from the global lock. Things that share a common lock is called a "lock space", and it is always possible to look up the lock that governs any given thing (see issue "Memory object structure"). A special global lock space, which corresponds to the shared data discussed above, is created on startup. All others have to be created explicitly. The intended use case for lock spaces is a "moderately large" collection of things: Too large and you get outlocking problems, too small and the lock overhead (both execution- and memorywise) gets prohibiting. A typical lock space could be a RAM cache consisting of a mapping and all its content. Many different varieties of lock space locks can be considered, e.g. a simple exclusive access mutex lock or a read/write lock, priority locks, locks that ensure fairness, etc. Therefore different (C-level) implementations should be allowed. One important characteristic of lock space locks is whether they are implicit or explicit: Implicit locks are locked internally, without intervention on the pike level. The lock duration is unspecified; locks are only acquired to ensure internal consistency. All low level data access functions check whether the lock space for the accessed thing is locked already. If it isn't then the lock is acquired automatically. All implicit locks have a well defined lock order (by pointer comparison), and since they only are taken to guarantee internal consistency, an access function can always release a lock to ensure correct order (see also issue "Lock space locking"). Explicit locks are exposed to the pike level and must be locked in a similar way to Thread.Mutex. If a low level data access function encounters an explicit lock that isn't locked, it throws an error. Thus it is left to the pike programmer to avoid deadlocks, but the pike core won't cause any by itself. Since the pike core keeps track which lock governs which thing it ensures that no lock violating access occurs, which is a valuable aid to ensure correctness. One can also consider a variant with a read/write lock space lock that is implicit for read but explicit for write, thus combining atomic pike-level updates with the convenience of implicit locking for read access. The scope of a lock space lock is (at least) the state inside all the
4fc3412009-04-20Martin Stjernholm things it contains (with a couple exceptions - see issue "Lock space lock semantics"), but not the set of things itself, i.e. things might be added to a lock space without holding a write lock. Removing a thing from a lock space always requires the write lock on it since that is necessary to ensure that a lock actually governs a thing for as long as it is held (regardless it's for reading or writing).
d60f152009-04-19Martin Stjernholm  See also issues "Memory object structure" and "Lock space locking" for more details.
1830ba2009-04-19Martin Stjernholm Issue: Memory object structure Of concern are the memory objects known to the gc. They are called "things", to avoid confusion with "objects" which are the structs for pike objects.
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm There are two types of things:
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm o First class things with gc header and lock space pointer. Most pike visible types are first class things. The exceptions are ints and floats, which are passed by value.
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm o Second class things contain only a gc header. They are similar to first class except that their lock spaces are implicit from the referencing things, which means all those referencing things must always be in the same lock space.
d60f152009-04-19Martin Stjernholm  Thread local things could have NULL as lock space pointer, but as a debug measure they could also point to the thread object so that it's possible to detect bugs with a thread accessing things local to another thread.
1830ba2009-04-19Martin Stjernholm Before the multi-cpu architecture, there are global double-linked lists for each referenced pike type: array, mapping, multiset, object, and program (strings and types are handled differently). Thanks to the new gc, the double-linked lists aren't needed at all anymore. +----------+ +----------+ | Thread 1 | | Thread 2 | .+----------+. .+----------+. : refs O : : O O : ,----- O <--> O : ,------- O O ------. | : O O -----. | : O O : | | :............: | | :............: | ref | | ref | ref | ref | | | | .|.............. ..v.......v..... refs ..............|. : | refs : ref : O O O <------> O O v : : v O <---> O ------------> O O : : O O : : O O O O : : O O O : : O O O : +--------------+ +--------------+ +--------------+ | Lock space 1 | | Lock space 2 | | Lock space 3 | +--------------+ +--------------+ +--------------+ This figure tries to show some threads and lock spaces, and their associated things as O's inside the dotted areas. Some examples of possible references between things are included: Thread local things can only reference things belonging to the same thread or things in any lock space, while things in lock spaces can reference things in the same or other lock spaces. There can be cyclic structures that span lock spaces. The lock space lock structs are tracked by the gc just like anything else, and they are therefore garbage collected when they become empty and unreferenced. The gc won't free a lock space lock struct that is locked since it always got at least one reference from the array of locked locks that each thread maintains (c.f. issue "Lock space locking").
d60f152009-04-19Martin Stjernholm  Issue: Lock space lock semantics There are three types of locks: o A read-safe lock ensures only that the data is consistent, not that it stays constant. This allows lock-free updates in things where possible (which could include arrays, mappings, and maybe even multisets and objects of selected classes). o A read-constant lock ensures both consistency and constantness (i.e. what usually is assumed for a read-only lock). o A write lock ensures complete exclusive access. The owning thread can modify the data, and it can assume no other changes occur to it
c3afc32009-04-26Martin Stjernholm  (barring refcounters and lock space pointers - see below), although that assumption has to be "weak" since there are a few situations when another thread can intervene - see issue "Emulating the interpreter lock". The owning thread can also under limited time leave the data in
4fc3412009-04-20Martin Stjernholm  inconsistent state. This is however still limited by the calls to check_threads(), which means that the state must be consistent
c3afc32009-04-26Martin Stjernholm  again every time the evaluator callbacks are run. The reason is the same one as above.
d60f152009-04-19Martin Stjernholm  Allowing lock-free updates is attractive, so the standard read/write lock that governs the global lock space will probably be multiple read-safe/single write.
4fc3412009-04-20Martin Stjernholm The lock space lock covers all the data in the thing, with two exceptions: o The refcounter (and other gc-related flags and fields) can always change concurrently since the gc runs in a thread of its own, and it doesn't heed any locks - see issue "Garbage collector". A ref to a thing can always be added or removed, even if another thread holds an exclusive write lock on it. That since the thing will only be freed by the gc, which won't free it if a ref is added. Refcount updates need to be atomic if the refcounts are to be used at all from other threads. Even so, they can only be used opportunistically since they (almost) always might change asynchronously. That could still be good enough for e.g. Pike.count_memory (noone could expect it to be accurate anyway if another thread is modifying the data structure being measured). o The lock space pointer itself must at all times be either NULL or point to a valid lock space struct, since another thread need to access it to tell whether access to the thing is permissible. A write lock is required to change the lock space pointer, but even so the update must be atomic. Since the lock space lock structs are collected by the gc, there is no risk for races when threads asynchronously dereference lock space pointers.
d60f152009-04-19Martin Stjernholm 
c3afc32009-04-26Martin Stjernholm FIXME: What about concurrent gc access to follow pointers?
d60f152009-04-19Martin Stjernholm  Issue: Lock space locking
1830ba2009-04-19Martin Stjernholm This is the locking procedure to access a thing:
d60f152009-04-19Martin Stjernholm  1. Read the lock space pointer. If it's NULL then the thing is thread local and nothing more needs to be done. 2. Address an array containing the pointers to the lock spaces that are already locked by the thread. 3. Search for the lock space pointer in the array. If present then nothing more needs to be done. 4. Lock the lock space lock as appropriate. Note that this can imply other implicit locks that are held are unlocked to ensure correct lock order (see issue "Lock spaces"). Then it's added to the array. A thread typically won't hold more than a few locks at any time (less than ten or so), so a plain array and linear search should perform well. For quickest possible access the array should be a static thread local variable (c.f. issue "Thread local storage"). If the array gets full, implicit locks in it can be released automatically to make space. Still, a system where more arrays can be allocated and chained on would perhaps be prudent to avoid the theoretical possibility of running out of space for locked locks. Since implicit locks can be released (almost) at will, they are open for performance tuning: Too long lock durations and they'll outlock other threads, too short and the locking overhead becomes more significant. As a starting point, it seems reasonable to release them at every evaluator callback call (i.e. at approximately every pike function call and return).
c280ec2009-04-19Martin Stjernholm Issue: Garbage collector Pike has used refcounting to collect noncyclic structures, combined with a stop-the-world periodical collector for cyclic structures. The periodic pauses are already a problem, and it only gets worse as the heap size and number of concurrent threads increase. Since the gc needs an overhaul anyway, it makes sense to replace it with a more modern solution. http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2006/PHD/PHD-2006-10.ps is a recent thesis work that combines several state-of-the-art gc algorithms to an efficient whole. A brief overview of the highlights: o The reference counts aren't updated for references on the stack. The stacks are scanned when the gc runs instead. This saves a great deal of refcount updates, and it also simplifies C level programming a lot. Only refcounts between things on the heap are counted. o The refcounts are only updated when the gc runs. This saves a lot of the remaining updates since if a pointer starts with value p_0 and then changes to p_1, then p_2, p_3, ..., and lastly to p_n at the next gc, then only p_0->refs needs to be decremented and p_n->refs needs to be incremented - the changes in all the other refcounts, for the things pointed to in between, cancel out. o The above is accomplished by thread local logging, to make the old p_0 value available to the gc at the next run. This means it scales well with many cpu's. o A generational gc uses refcounting only for old things in the heap. New things, which are typically very short-lived, aren't refcounted at all but instead gc'ed using a mark-and-sweep collector. This is shown to be more efficient for short-lived data, and it handles cyclic structures without any extra effort. o By using refcounting on old data, the gc only need to give attention to refcounts that gets down to zero. This means the heap can scale to any size without affecting the gc run time, as opposed to using a mark-and-sweep collector on the whole heap. Thus the gc time scales only with the amount of _change_ in the heap. o Cyclic structures in the old refcounted data is handled incrementally using the fact that a cyclic structure can only occur when a refcounter is decremented to a value greater than zero. Those things can therefore be tracked and cycle checked in the background. The gc uses several different methods to weed out false alarms before doing actual cycle checks. o The gc runs entirely in its own thread. It only needs to stop the working threads for a very short time to scan stacks etc, and they can be stopped one at a time. Effects of using this in Pike: a. References from the C or pike stacks don't need any handling at all (see also issue "Garbage collection and external references"). b. A significant complication in various lock-free algorithms is the safe freeing of old blocks (see e.g. issue "Lock-free hash table"). This gc would solve almost all such problems in a convenient way. c. Special code is used to update refs in the heap. During certain circumstances, before changing a pointer inside a thing which can point to another thing, the state of all non-NULL pointers in it are copied to a thread local log. This is mostly problematic since it requires that every pointer assignment inside a thing is replaced with a macro or function call, which has a big impact on C code. See issue "C module interface". d. A new log_pointer field is required per thing. If a state copy has taken place as described above, it points to the log that contains the original pointer state of the thing. Data containers that can be of arbitrary size (i.e. arrays, mappings and multisets) should be segmented into fixed-sized chunks with one log_pointer each, so that the state copy doesn't get arbitrarily large. e. The double-linked lists aren't needed. Hence two pointers less per thing. f. The refcounter word is changed to hold both normal refcount, weak count, and flags. Overflowed counts are stored in a separate hash table. g. The collector typically runs concurrently with the rest of the
f6209f2009-04-18Martin Stjernholm  program, but there are some situations when it has to synchronize with them (aka handshake). In the research paper this is done by letting the gc thread suspend and resume the other threads (one at a time). Since preemptive suspend and resume operations are generally unsupported in thread libraries (c.f. issue "Preemptive thread suspension"), a cooperative approach is necessary: The gc thread sets a state flag that all other threads need a handshake. Threads that are running do the handshake work themselves before waiting on a mutex or in the next evaluator callback call, and the gc thread handles the threads that are currently waiting (ensuring that they don't start in the meantime). The work that needs to be done during a handshake is to set some flags and record some local thread state for use by the gc thread. This can be done concurrently in several threads, so no locking is necessary. Due to this interaction with the other threads, it's vital that the gc thread does not hold any mutex, and that it takes care to avoid being stopped (e.g. through an interrupt) while it works on behalf on another thread.
c280ec2009-04-19Martin Stjernholm  h. All garbage collection, both for noncyclic and cyclic garbage, are discovered and handled by the gc thread. The other threads never frees any block known to the gc. i. An effect of the above is that all garbage is discovered by a separate collector thread which doesn't execute any other pike code. This opens up the issue on how to call destruct functions. At least thread local things should reasonably get their destruct calls in that thread. A problem is however what to do when that thread has exited or emigrated (see issue "Foreign thread visits"). For shared things it's not clear which thread should call destruct anyway, so in that case any thread could do it. It might however be a good idea to not do it directly in the gc thread, since doing so would require that thread too to be a proper pike thread with pike stack etc; it seems better to keep it an "invisible" low-level thread outside the "worker" threads. In programs with a "backend thread" it could be useful to allow the gc thread wake up the backend thread to let it execute the destruct calls. j. The most bothersome problem is that things are no longer freed right away when running out of refs. See issue "Immediate destruct/free when refcount reaches zero".
0439622009-04-18Martin Stjernholm k. Weak refs are handled with a separate refcount in each thing. That means things have two refcounts: One for weak refs and another for all refs. See also issue "Weak ref garbage collection".
c280ec2009-04-19Martin Stjernholm  l. One might consider separating the refcounts from the things by using a hash table. This makes sense when considering that only the collector thread is using the refcounts, thereby avoiding false aliasing occurring from refcounter updates (and other gc related flags) by that thread. All the hash table lookups would however incur a significant overhead in the gc thread. A better alternative would be to use a bitmap based on the possible allocation slots used by the malloc implementation, but that would require very tight integration with the malloc system. The bitmap could work with only two bits per refcounter - research shows that most objects in a refcounted heap have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3 would then be stored in a hash table. To simplify memory handling, the gc should be used consistently on all heap structs, regardless whether they are pike visible things or not. An interesting question is whether the type info for every struct (more concretely, the address of some area where the gc can find the functions it needs to handle the struct) is carried in the struct itself (through a new pointer field), or if it continues to be carried in the context for every pointer to the struct (e.g. in the type field in svalues). Since the gc would be used for most internal structs as well, which are almost exclusively used via compile-time typed pointers, it would probably save significant heap space to retain the type in the pointer context. It does otoh complicate the gc - everywhere where the gc is fed a pointer to a thing, it must also be fed a type info pointer, and the gc must then keep track of this data tuple internally. Issue: Immediate destruct/free when refcount reaches zero When a thing in Pike runs out of references, it's destructed and freed almost immediately in the pre-multi-cpu implementation. This behavior in Pike is used implicitly in many places. The major (hopefully all) principal use cases of concern are: 1. It's popular to make code that releases a lock timely by just storing it in a local variable that gets freed when the function exits (either by normal return or by exception). E.g: void foo() { Thread.MutexKey my_lock = my_mutex->lock(); ... do some work ... // my_lock falls out of scope here when the function exits // (also if it's due to a thrown exception), so the lock is // released right away. } There's also code that opens files and sockets etc, and expects them to be automatically closed again through this method. (That practice has been shown to be bug prone, though, so in the sources at Roxen many of those places have been fixed over time.) 2. In some cases, structures are carefully kept acyclic to make them get freed quickly, and there is no control of which party that got the "last reference". One example is if a cache holds one ref to an entry, and there might at the same time be one or more worker threads that hold references to the same entry while they use it. In this case the cache can be pruned safely by dropping the reference to the entry, without destructing it. A variant when the structure cannot be made acyclic is to make a "wrapper object": It holds a reference to the cyclic structure, and all other parties makes sure to hold a ref to the wrapper as long as they got interest in any part of the data. When the wrapper runs out of refs, it destructs the cyclic structure explicitly. These tricks have mostly been used to reduce the amount of cyclic garbage that requires the stop-the-world gc to run more often, but there are also occasions when the structure holds open fd's which must be closed without delay (one such occasion is the connection fd in the http protocol in the Roxen WebServer). 3. In some applications with extremely high data mutation rate, the immediate freeing of acyclic structures is seen as a prerequisite to keep bounds on memory consumption. 4. FIXME: Are there more? The proposed gc (c.f. issue "Garbage collector") does not retain the immediate destruct and free semantic - only the gc running in its own thread may free things. Although it would run much more often than the old gc (probably on the order of once a minute up to several times a second), it would still break this semantic. To discuss each use case above: 1. Locks, and in some cases also open fd's, cannot wait until the next gc run. Observing that mutex locks always are thread local things, almost all these cases (exceptions are possibly fd objects that somehow are shared anyway) can be solved by a modified gc approach - see issue "Micro-gc". Since the micro-gc approach appears to be expensive, it's worth considering to actually ditch this behavior and solve the problem on the pike level instead. The compiler can be used to detect many of these cases by looking for assignments to local variables that aren't accessed from anywhere (there is already such a warning, but it has been tuned down just to allow this problematic idiom). A new language construct would be necessary, to ensure that the variable gets destructed both on normal function exit and when an exception is thrown. It could look something like this: void foo() { destruct_on_exit (Thread.MutexKey my_lock = my_mutex->lock()) { ... do some work which requires the lock ... } } I.e. the destruct_on_exit clause ensures that the variable(s) in the parentheses are destructed (regardless of the amount of refs) if execution passes out of the block in any way. Anyway, since implementing the micro-gc is a comparatively small amount of extra work, the intention is to do that first, and then later implement the full gc as an experimental mode so that performance can be compared. 2. This is not a problem as long as the reason only is gc efficiency. It's worth noting that tricks such as "wrapper objects" still have some use since they lessen the load on the background cycle detector. It is however a problem if there are open fd's or similar things in the structure. It doesn't look like this is feasible to solve internally; such structures typically are shared data, and letting different threads reference shared data without locking is essential for multi-cpu performance. This is therefore a case that is probably best to solve on the pike level instead, possibly through pike-visible refcounting. These cases appear to be fairly few, at least. 3. If the solution in the issue "Micro-gc" is implemented, this problem hardly exists at all since thread local data is refcounted and freed almost exactly the same way as before. Otherwise, since the gc thread operate only on the new and changed data, and collects newly allocated data very efficiently, it would keep up with a very high mutation rate. GC runs are scheduled to run just often enough to keep the heap size within a set limit - as long as the gc thread doesn't become saturated and runs continuously, it offloads the refcounting and freeing overhead from the worker threads completely. If the data mutation rate is so high that the gc thread becomes saturated, what would happen is that malloc calls would start to block when the heap limit is reached. Research shows that a periodic gc done right provides considerably more throughput than pure refcounting, so the application would still run faster including that blocking. The remaining concern is then that the blocking would introduce uneven response times - the worker threads would go very fast most of the time but every once in a while they could hang waiting on the gc thread. These hangs are (according to the research paper) on the order of milliseconds, but if they still are problematic then a crude solution would be to introduce artificial short sleeps in the working threads to bring down the mutation rate - even with those sleeps the application would probably still be significantly faster than the current approach. Issue: Micro-gc A way to retain the immediate-destruct (and free) semantic for thread local things referenced only from the pike stack is to implement a "micro-gc" that runs very quickly and is called often enough to keep the semantic. To begin with, the mark-and-sweep gc for new data (as discussed in the issue "Garbage collector") is not implemented, and the refcounts for thread local things are not delay-updated at all. The work of the micro-gc then becomes to free all things in the zero-count table (ZCT) that aren't referenced from the thread's C and pike stacks. Scanning the two stacks completely in every micro-gc would be too expensive. That is solved by partitioning the ZCT so that every pike stack frame gets one of its own. New zero-count things are always put in the ZCT for current topmost frame. That way, the micro-gc can scan the topmost parts of the stacks (above the last pike stack frame) for references to things in the topmost ZCT, and when a pike stack frame is popped then the things in its ZCT can be freed without scanning at all. This is enough to timely destruct and free the things put on the pike stack. Furthermore, since the old immediate-destruct semantics only requires destructing before and after every pike level function call, it won't be necessary for the micro-gc to scan the C stack at all (there's never any part of it above the current frame, i.e. above the innermost mega_apply, to scan). Note that the above works under the assumption that new things are only referenced from the stacks in or below the current frame. That's not always true - code might change the stack further back to reference new things, e.g. if a function allocates some temporary struct on the stack and then pass the pointer to it to subroutines that change it. Such code on the C level is very unlikely, since it would mean that C code would be changing something on the C stack back across a pike level apply. On the Pike level it can occur with inner functions changing variables in their surrounding functions. Those cases can however be detected and handle one way or the other. One way is to detect them at compile time and "stay" in the frame of the outermost surrounding function for the purposes of the micro-gc. That doesn't scale well if the inner functions are deeply recursive, though. This micro-gc approach comes at a considerable expense compared to the solution described in the issue "Garbage collector": Not only does the generational gc with mark-and-sweep for young data disappear (which according to the research paper gives 15-40% more total throughput), but the delayed updating of the refcounts disappear to a large extent too. Refcounting from the stacks is still avoided though, and delayed updating of refcounts in shared data is still done, which is crucial for multi-cpu performance. Issue: Single-refcount optimizations Pre-multi-cpu Pike makes use of the refcounting to optimize operations: Some operations that shouldn't be destructive on their operands can be destructive anyway on an operand if it has no other references. A common case in adding elements to arrays: array arr = ({}); while (...) arr += ({another_element}); Here arr only got a single reference from the stack, so the += operator destructively grows the array to add new elements to the end of it. With the new gc approach, such single-refcount optimizations no longer work in general. This is the case even if the micro-gc is implemented, since stack refs aren't counted. FIXME: List cases and discuss solutions.
0439622009-04-18Martin Stjernholm Issue: Weak ref garbage collection When the two refcounters (one for total number of refs and another for the number of weak refs) are equal then the thing is semantically freed. The problem is that it still got refs which might be followed later, so the gc cannot free it. There are two ways to tackle this problem: One alternative is to keep track of all the weak pointers that point to each thing, so that they can be followed backwards and cleared when only weak pointers are left. That tracking requires additional data structures and the associated overhead, and clearing the other pointers might require lock space locks to be taken. Another alternative is to free all refs emanating from the thing with only weak pointers left, and keep it as an empty structure (a destructed object, an empty array/multiset/mapping, or an empty skeleton program which contains no identifiers). This approach requires a flag to recognize such semi-freed things, and that all code that dereference weak pointers check for it. A problem is that data blocks remain allocated longer than necessary, maybe even indefinitely. That can be mitigated to some degree by shortening them using realloc(3).
1830ba2009-04-19Martin Stjernholm Issue: Moving things between lock spaces Things can be moved between lock spaces, or be made thread local or disowned. In all these cases, one or more things are given explicitly. It's natural if not only those things are moved, but also all other things in the same source lock space that are referenced from the given things and not from anywhere else (this operation is the same as Pike.count_memory does). In the case of making things thread local or disowned, it is also necessary to check that the explicitly given things aren't referenced from elsewhere.
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm FIXME: This is a problem with the proposed garbage collector (see issue "Garbage collector"). Old things got refcounts that can be used, but they might be stale, and the logging doesn't provide information in the form we need. New things are even worse since they got no refcounts at all that can be used to check for outside refs. Furthermore, there is a race since an external ref can be added at any time from any thread.
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm All this is settled when the gc is run: If the "controlled" refs are temporarily ignored then the set to move is the one that would turn into garbage. But it is not good to either have to wait for the gc or run it synchronously.
d60f152009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm Also, the problem above applies to Pike.count_memory too.
d60f152009-04-19Martin Stjernholm  Issue: Strings Strings are unique in Pike. This property is hard to keep if threads have local string pools, since a thread local string might become shared at any moment, and thus would need to be moved. Therefore the string hash table remains global, and lock congestion is avoided with some concurrent access hash table implementation. See issue "Lock-free hash table". Lock-free is a good start, but the hash function must also provide a good even distribution to avoid hotspots. Pike currently uses an in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a more widespread and better studied alternative should be considered. There seems to be few that are below O(n) (which DO_HASHMEM is), though. Issue: Types Like strings, types are globally unique and always shared in Pike. That means lock-free access to them is desirable, and it should also
1830ba2009-04-19Martin Stjernholm be doable fairly easily since they are constant. Otoh it's probably not as vital as for strings since types typically only are built during compilation.
d60f152009-04-19Martin Stjernholm 
c280ec2009-04-19Martin Stjernholm Issue: Mapping and multiset data blocks Mappings and multisets currently have a deferred copy-on-write behavior, i.e. several mappings/multisets can share the same data block and it's only copied to a local one when changed through a specific mapping/multiset. If mappings and/or multisets are changed to be lock-free then the copy-on-write behavior needs to be solved: o A flag is added to the mapping/multiset data block that is set whenever it is shared. o Every destructive operation checks the flag. If set, it makes a copy, otherwise it changes the original block. Thus the flag is essentially a read-only marker.
c3708b2009-04-26Martin Stjernholm o In addition to the flag, the gc performs normal refcounting. It clears the flag if the refcount is 1. (The refcount cannot be used directly since it's delay-updated.)
c280ec2009-04-19Martin Stjernholm o Hazard pointers are necessary for every destructive access, including the setting of the flag. The reason is that the read-onlyness only is in effect after all currently modifying threads are finished with the block. The thread that is setting the flag therefore has to wait until there are no other hazard pointers to the block before returning. It's a good question whether keeping the copy-on-write feature is worth this overhead. Of course, an alternative is to simply let the builtin mappings and/or multisets be locking, and instead have special objects that implements lock-free data types. Another issue is if things like mapping/multiset data blocks should be first or second class things (c.f. issue "Memory object structure"). If they're second class it means copy-on-write behavior doesn't work across lock spaces. If they're first class it means additional overhead handling the lock spaces of the mapping data blocks, and if a mapping data is shared between lock spaces then it has to be in some third lock space of its own, or in the global lock space, neither of
c3708b2009-04-26Martin Stjernholm which would be very good. So it doesn't look like there's a better way than to botch copy-on-write in this case.
d60f152009-04-19Martin Stjernholm  Issue: Emulating the interpreter lock For compatibility with old C modules, and for the _disable_threads function, it is necessary to retain a complete lock like the current interpretator lock. It has to lock the global area for writing, and also stop all access to all lock spaces, since the thread local data might refer to any lock space. This lock is implemented as a read/write lock, which normally is held permanently for reading by all threads. Only when a thread is waiting to acquire the compat interpreter lock is it released as each thread goes into check_threads(). This lock cannot wait for explicit lock space locks to be released. Thus it can override the assumption that a lock space is safe from tampering by holding a write lock on it. Still, it's only available from the C level (with the exception of _disable_threads) so the situation is not any different from the way the interpreter lock overrides Thread.Mutex today. Issue: Function calls A lock on an object is almost always necessary before calling a function in it. Therefore the central apply function (mega_apply) must ensure an appropriate lock is taken. Which kind of lock (read-safe/read-constant/write - see issue "Lock space lock semantics") depends on what the function wants to do. Therefore all object functions are extended with flags for this. The best default is probably read-safe. Flags for no locking (for the few special cases where the implementations actually are completely lock-free) and for compat-interpreter-lock-locking should probably exist as well. A compat-interpreter-lock flag is also necessary for global functions that don't have a "this" object (aka efuns). Having the required locking declared this way also alleviates each function from the burden of doing the locking to access the current storage, and it allows future compiler optimizations to minimize lock operations. Issue: Exceptions "Forgotten" locks after exceptions shouldn't be a problem: Explicit locks are handled just like today (i.e. it's up to the pike programmer), and implicit locks can safely be released when an exception is thrown. One case requires attention: An old-style function that requires the compat interpreter lock might catch an error. In that case the error
c3afc32009-04-26Martin Stjernholm system has to ensure that lock is reacquired.
d60f152009-04-19Martin Stjernholm  Issue: C module interface A new add_function variant is probably added for new-style functions. It takes bits for the flags discussed for issue "Function calls". New-style functions can only assume free access to the current storage according to those flags; everything else must be locked (through a new set of macros/functions). Accessor functions for data types (e.g. add_shared_strings, mapping_lookup, and object_index_no_free) handles the necessary locking internally. They will only assume that the thing is safe, i.e. that the caller ensures the current thread controls at least one ref. THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in new-style functions. There will be new GC callbacks for walking module global pointers to things (see issue "Garbage collection and external references").
c280ec2009-04-19Martin Stjernholm The proposed gc requires that every pointer change in a (heap allocated) thing is tracked (for pointers that might point to other heap allocated things). This is because the gc has to log the old state of the pointers before the first change after a gc run (see issue "Garbage collector", item c). For all builtin data types, this is handled internally in primitives like mapping_insert and object_set_index, so the only cases that the C module code typically has to handle are direct updates in the current storage. Therefore all pointer changes that currently looks someting like THIS->my_thing = some_thing; must be wrapped in some kind of macro/function call to become: set_ptr (THIS, my_thing, some_thing); On the positive side, all the refcount twiddling to account for references from the C and pike stacks can be removed from the C code. That also includes a lot of the SET_ONERROR stuff which currently is necessary to avoid lost refs when errors are thrown.
d60f152009-04-19Martin Stjernholm  Issue: C module compatibility
c280ec2009-04-19Martin Stjernholm Currently it doesn't look like the goal to keep a source-level compatibility mode for C modules can be achieved. The problem is that every pointer assignment in every heap allocated thing must be wrapped inside a macro/function call to make the new gc work (see issue "Garbage collector", item c), and lots of C module code change such pointers directly through plain assignments.
d60f152009-04-19Martin Stjernholm Ref issue "Emulating the interpreter lock". Ref issue "Garbage collection and external references". Issue: Garbage collection and external references The current gc design is that there is an initial "check" pass that determines external references by counting all internal references, and then for each thing subtract it from its refcount. If the result isn't zero then there are external references (e.g. from global C variables or from the C stack) and the thing is not garbage.
1830ba2009-04-19Martin Stjernholm The new gc (c.f. issue "Garbage collector") does not refcount external refs and refs from the C or Pike stacks. It needs to find them some other way:
d60f152009-04-19Martin Stjernholm  References from global C variables are few, so they can be dealt with by requiring C modules and the core parts to provide callbacks that lets the gc walk through them (see issue "C module interface"). This is however not compatible with old C modules. References from C stacks are common, and it is infeasible to require callbacks that keep track of them. The gc instead has to scan the C stacks for the threads and treat any aligned machine word containing
1830ba2009-04-19Martin Stjernholm an apparently valid pointer to a gc candidate thing as an external reference. This is the common approach used by standalone gc libraries that don't require application support. For reference, here is one such garbage collector, written in C++:
d60f152009-04-19Martin Stjernholm http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/Introduction.html#//apple_ref/doc/uid/TP40002427 Its source is here: http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/
c280ec2009-04-19Martin Stjernholm The same approach would also be necessary to cope with old C modules (see issue "C module compatibility"), but since global C level pointers are few, it might not be mandatory to get this working. And besides, it appears unlikely that compatibility with old C modules can be kept.
d60f152009-04-19Martin Stjernholm  Issue: Global pike level caches
c280ec2009-04-19Martin Stjernholm Global caches that are shared between threads are common, and in almost all cases such caches are implemented using mappings. There's therefore a need for (at least) a hash table data type that handle concurrent access and high mutation rates very efficiently. Issue "Lock-free hash table" discusses such a solution. It's currently not clear whether the builtin mappings will be lock-free or not (c.f. the copy-on-write problem in issue "Mapping and multiset data blocks"), but if they're not then a mapping-like object class is implemented that is lock-free. It's easy to replace global cache mappings with such objects.
d60f152009-04-19Martin Stjernholm  Issue: Thread.Queue A lock-free implementation should be used. The things in the queue are typically disowned to allow them to become thread local in the reading thread.
c280ec2009-04-19Martin Stjernholm Issue: "Relying on the interpreter lock"
c3afc32009-04-26Martin Stjernholm FIXME
c280ec2009-04-19Martin Stjernholm 
d60f152009-04-19Martin Stjernholm Issue: False sharing False sharing occurs when thread local things used frequently by different threads are next to each other so that they share the same cache line. Thus the cpu caches might force frequent resynchronization of the cache line even though there is no apparent hotspot problem on the C level. This can be a problem in particular for all the block_alloc pools containing small structs. Using thread local pools is seldom a workable solution since most thread local structs might become shared later on. One way to avoid it is to add padding (and alignment). Cache line sizes are usually 64 bytes or less (at least for Intel ia32). That should be small enough to make this viable in many cases. FIXME: Check cache line sizes on the other important architectures. Another way is to move things when they get shared, but that is pretty complicated and slow. Issue: Malloc and block_alloc Standard OS mallocs are usually locking. Bundling a lock-free one could be important. FIXME: Survey free implementations. Block_alloc is a simple homebrew memory manager used in several different places to allocate fixed-size blocks. The block_alloc pools are often shared, so they must allow efficient concurrent access. With a modern malloc, it is possible that the need for block_alloc is gone, or perhaps the malloc lib has builtin support for fixed-size pools. Making a lock-free implementation is nontrivial, so the homebrew ought to be ditched in any case. A problem with ditching block_alloc is that there is some code that walks through all allocated blocks in a pool, and also avoids garbage by freeing the whole pool altogether. FIXME: Investigate alternatives here. See also issue "False sharing".
1830ba2009-04-19Martin Stjernholm Issue: Heap size control There should be better tools to control the heap size. It should be possible to set the wanted heap size so that the gc runs timely before that limit is reached. Pike should detect the available amount of real memory (i.e. not counting swap) to use as default. The gc should still use a garbage projection strategy to keep the process below the configured maximum size for as long as possible. This is more important if the gc is used also for previously refcounted garbage (c.f. issue "Garbage collector"). Malloc calls should be wrapped to allow the gc to run in blocking mode in case they fail.
d60f152009-04-19Martin Stjernholm Issue: The compiler
c3afc32009-04-26Martin Stjernholm FIXME
d60f152009-04-19Martin Stjernholm  Issue: Foreign thread visits
c3afc32009-04-26Martin Stjernholm FIXME. JVM threads..
d60f152009-04-19Martin Stjernholm  Issue: Pike security system It is possible that keeping the pike security system intact would complicate the implementation, and even if it was kept intact a lot of testing would be required before one can be confident that it really works (and there are currently very few tests for it in the test suite). Also, the security system isn't used at all to my (mast's) knowledge, and it is not even compiled in by default (has to be enabled with a configure flag). All this leads to the conclusion that it is easiest to ignore the security system altogether, and if possible leave it as it is with the option to get it working later. Issue: Contention-free counters There is probably a need for contention-free counters in several different areas. They should be possible to update from several threads in parallell without synchronization. Querying the current count is always approximate since it can be changing simultaneously in other threads. However, the thread's own local count is always accurate. They should be separated from the blocks they apply to, to avoid cache line invalidation of those blocks. To accomplish that, a generic tool somewhat similar to block_alloc is created that allocates one or more counter blocks for each thread. In these blocks indexes are allocated, so a counter is defined by the same index into all the thread local counter blocks. Each thread can then modify its own counters without locking, and it typically has its own counter blocks in the local cache while the corresponding main memory is marked invalid. To query a counter, a thread would need to read the blocks for all other threads. This means that these counters are efficient for updates but less so for queries. However, since queries always are approximate, it is possible to cache them for some time (e.g. 1 ms). Each thread would need its own cache though, since the local count cannot be cached. It should be lock-free for allocating and freeing counters, and preferably also for starting and stopping threads (c.f. issue "Foreign thread visits"). In both cases the freeing steps represents a race problem - see issue "Hazard pointers". To free counters, the counter index would constitute the hazard pointer. Issue: Lock-free hash table A good lock-free hash table implementation is necessary. A promising one is http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html. It requires a CAS (Compare And Swap) instruction to work, but that shouldn't be a problem. The java implementation (http://sourceforge.net/projects/high-scale-lib) is Public Domain. In the comments there is talk about efforts to make a C version. It supports (through putIfAbsent) the uniqueness requirement for strings, i.e. if several threads try to add the same string (at different addresses) then all will end up with the same string pointer afterwards. The java implementation relies on the gc to free up the old hash
c280ec2009-04-19Martin Stjernholm tables after resize. The proposed gc (issue "Garbage collector") would solve it for us too, but even without that the problem is still solvable - see issue "Hazard pointers".
d60f152009-04-19Martin Stjernholm  Issue: Hazard pointers A problem with most lock-free algorithms is how to know no other thread is accessing a block that is about to be freed. Another is the ABA problem which can occur when a block is freed and immediately allocated again (common for block_alloc). Hazard pointers are a good way to solve these problems without leaving the blocks to the garbage collector (see http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf). So a
1830ba2009-04-19Martin Stjernholm generic hazard pointer tool might be necessary for blocks not known to the gc.
d60f152009-04-19Martin Stjernholm  Note however that a more difficult variant of the ABA problem still can occur when the block cannot be freed after leaving the data structure. (In the canonical example with a lock-free stack - see e.g. "ABA problem" in Wikipedia - consider the case when A is a thing that continues to live on and actually gets pushed back.) The only reliable way to cope with that is probably to use wrappers. Issue: Thread local storage Implementation would be considerably simpler if working TLS can be assumed on the C level, through the __thread keyword (or __declspec(thread) in Visual C++). A survey of the support for TLS in common compilers and OS'es is needed to decide whether this is an workable assumption: o GCC: __thread is supported. Source: Wikipedia. FIXME: Check from which version. o Visual C++: __declspec(thread) is supported. Source: Wikipedia. FIXME: Check from which version. o Intel C compiler: Support exists. Source: Wikipedia. FIXME: Check from which version. o Sun C compiler: Support exists. Source: Wikipedia. FIXME: Check from which version. o Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works for dynamic libs. C.f. http://people.redhat.com/drepper/tls.pdf. FIXME: Check from which version of glibc and kernel (if relevant). o Windows (i386, x86_64): TLS is supported but does not always work in dll's loaded using LoadLibrary (which means all dynamic modules in pike). C.f. http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx. According to Wikipedia this is fixed in Vista and Server 2008 (FIXME: verify). In any case, TLS is still usable in the pike core. o MacOS X: FIXME: Check this. o Solaris: FIXME: Check this. o *BSD: FIXME: Check this. Issue: Platform specific primitives Some low-level primitives, such as CAS and fences, are necessary to build the various lock-free tools. A third-party library would be useful.
c3afc32009-04-26Martin Stjernholm o An effort to make a standardized library is here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html (C level interface at the end). It apparently lacks implementation, though. o The linux kernel is reported to contain a good abstraction lib for these primitives, along with implementations for a large set of architectures (see http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08.21a.pdf). o Another one is part of a lock-free hash implementation here: http://www.sunrisetel.net/software/devtools/sunrise-data-dictionary.shtml It has a MIT-style open source license (with ad clauses).
d60f152009-04-19Martin Stjernholm 
c3afc32009-04-26Martin Stjernholm It appears that the libraries themselves are very short and simple; the difficult part is rather to specify the semantics carefully. It's probably easiest to make one ourselves with ideas from e.g. the linux kernel paper mentioned above.
c280ec2009-04-19Martin Stjernholm 
d60f152009-04-19Martin Stjernholm Required operations: CAS(address, old_value, new_value) Compare-and-set: Atomically sets *address to new_value iff its current value is old_value. Needed for 32-bit variables, and on 64-bit systems also for 64-bit variables. ATOMIC_INC(address) ATOMIC_DEC(address) Increments/decrements *address atomically. Can be simulated with CAS. 32-bit version necessary, 64-bit version would be nice. LFENCE() Load fence: All memory reads in the thread before this point are guaranteed to be done (i.e. be globally visible) before any following it. SFENCE() Store fence: All memory writes in the thread before this point are guaranteed to be done before any following it. MFENCE() Memory fence: Both load and store fence at the same time. (On many architectures this is implied by CAS etc, but we shouldn't assume that.) The following operations are uncertain - still not known if they're useful and supported enough to be required, or if it's better to do without them: CASW(address, old_value_low, old_value_high, new_value_low, new_value_high) A compare-and-set that works on a double pointer size area. Supported on more modern x86 and x86_64 processors (c.f. http://en.wikipedia.org/wiki/Compare-and-swap#Extensions). FIXME: More.. Survey of platform support: o Windows/Visual Studio: Got "Interlocked Variable Access": http://msdn.microsoft.com/en-us/library/ms684122.aspx o FIXME: More..
c280ec2009-04-19Martin Stjernholm Issue: Preemptive thread suspension
f6209f2009-04-18Martin Stjernholm The proposed gc as presented in the research paper needs to suspend and resume other threads. A survey of platform support for preemptive thread suspension:
c280ec2009-04-19Martin Stjernholm  o POSIX threads: No support. Deprecated and removed from the standard since it can very easily lead to deadlocks. On some systems there might still be a pthread_suspend function. o Windows: SuspendThread and ResumeThread exists but are only intended for use by debuggers.
f6209f2009-04-18Martin Stjernholm It's clear that a nonpreemptive method is required. See issue "Garbage collector" item g for details on that.
c280ec2009-04-19Martin Stjernholm 
1830ba2009-04-19Martin Stjernholm Issue: OpenMP OpenMP (see www.openmp.org) is a system to parallelize code using pragmas that are inserted into the code blocks. It can be used to easily parallelize otherwise serial internal algorithms like searching and all sorts of loops over arrays etc. Thus it addresses a different problem than the high-level parallelizing architecture above, but it might provide significant improvements nevertheless. It's therefore worthwhile to look into how this can be deployed in the Pike sources. If support is widespread enough, it could be considered to even make it a requirement to be able to deploy the builtin tools for atomicity and ordering (provided they are useful outside the omp parallellized blocks). Compiler support (taken from www.openmp.org): o gcc since 4.3.2. o Microsoft Visual Studio 2008 or later. o Sun compiler (starting version unknown). o Intel compiler since 10.1. o ..and some more. FIXME: Survey platform-specific limitations.
d60f152009-04-19Martin Stjernholm Various links Pragmatic nonblocking synchronization for real-time systems http://www.usenix.org/publications/library/proceedings/usenix01/full_papers/hohmuth/hohmuth_html/index.html DCAS is not a silver bullet for nonblocking algorithm design http://portal.acm.org/citation.cfm?id=1007945
c280ec2009-04-19Martin Stjernholm A simple and efficient memory model for weakly-ordered architectures http://www.open-std.org/Jtc1/sc22/WG21/docs/papers/2007/n2237.pdf