Branch: Tag:


2010-10-05 22:46:21 by Martin Stjernholm <>

Initial version.

1: + Multi-cpu support in Pike + -------------------------    -  + This is a draft spec for how to implement multi-cpu support in Pike. + The intention is that it gets extended along the way as more issues + gets ironed out. Discussions take place in "Pike dev" in LysKOM or + +  + Initial draft created 8 Nov 2008 by Martin Stjernholm. +  +  + Background and goals +  + Pike supports multiple threads, but like many other high-level + languages it only allows one thread at a time to access the data + structures. This means that the utilization of multi-cpu and + multi-core systems remains low, even though there are some modules + that can do isolated computational tasks in parallell (e.g. the Image + module). +  + It is the so-called "interpreter lock" that must be locked to access + any reference variable (i.e. everything except floats and native + integers). This lock is held by default in essentially all C code and + is explicitly unlocked in a region by the THREADS_ALLOW/ + THREADS_DISALLOW macros. On the pike level, the lock is always held - + no pike variable can be accessed and no pike function can be called + otherwise. +  + The purpose of the multi-cpu support is to rectify this. The design + goals are, in order of importance: +  + 1. Pike threads should be able to execute pike code concurrently on +  multiple cpus as long as they only modify thread local pike data +  and read a shared pool of static data (i.e. the pike programs, +  modules and constants). +  + 2. There should be as few internal hot spots as possible (preferably +  none) when pike code is executed concurrently. Care must be taken +  to avoid internal synchronization, or updates of shared data that +  would cause "cache line ping-pong" between cpus. +  + 3. The concurrency should be transparent on the pike level. Pike code +  should still be able to access shared data without locking and +  without risking low-level inconsistencies. (So Thread.Mutex etc +  would still be necessary to achieve higher level synchronization.) +  + 4. There should be tools on the pike level to allow further +  performance tuning, e.g. lock-free queues, concurrent access hash +  tables, and the possibility to lock different regions of shared +  data separately. These tools should be designed so that they are +  easy to slot into existing code with few changes. +  + 5. There should be tools to monitor and debug concurrency. It should +  be possible to make assertions that certain objects aren't shared, +  and that certain access patterns don't cause thread +  synchronization. This is especially important if goal (3) is +  realized, since the pike code by itself won't show what is shared +  and what is thread local. +  + 6. C modules should continue to work without source level +  modification (but likely without allowing any kind of +  concurrency). +  + Note that even if goal (3) is accomplished, this is no miracle cure + that would make all multithreaded pike programs run with optimal + efficiency on multiple cpus. One could expect better concurrency in + old code without adaptions, but it could still be hampered + considerably by e.g. frequent updates to shared data. Concurrency is a + problem that must be taken into account on all levels. +  +  + Other languages +  + Perl: All data is thread local by default. Data can be explicitly + shared, in which case Perl ensures internal consistency. Every shared + variable is apparently locked individually. Referencing a thread local + variable from a shared one causes the thread to die. See perthrtut(1). +  + Python: Afaik it's the same state of affairs as Pike. +  +  + Solution overview +  + The basic approach is to divide all data into thread local and shared: +  + o Thread local data is everything that is accessible to one thread +  only, i.e. there are no references to anything in it from shared +  data or from any other thread. This is typically data that the +  current thread has created itself and only reference from the +  stack. The thread can access its local data without locking. +  + o Shared data is everything that is accessible from more than one +  thread. Access to it is synchronized using a global read/write +  lock, the so-called "global lock". I.e. this lock can either be +  locked for reading by many threads, or be locked by a single thread +  for writing. Locking the global lock for writing is the same as +  locking the interpreter lock in current pikes. (This single lock is +  refined later - see issue "Lock spaces".) +  + o There is also a special case where data can be "disowned", i.e. not +  shared and not local in any thread. This is used in e.g. +  Thread.Queue for the objects that are in transit between threads. +  Disowned data cannot have arbitrary references to it - it must +  always be under the control of some object that in some way ensures +  consistency. (Garbage could be made disowned since it by definition +  no longer is accessible from anywhere, but of course it is always +  better to clean it up instead.) +  + +--------+ +---------------------+ Direct +--------+ + | |<-- refs --| Thread 1 local data |<- - access - -| | + | | +---------------------+ | Thread | + | | | 1 | + | |<- - - - Access through global lock only - - - -| | + | Shared | +--------+ + | | + | data | +---------------------+ Direct +--------+ + | |<-- refs --| Thread 2 local data |<- - access - -| | + | | +---------------------+ | Thread | + | | | 2 | + | |<- - - - Access through global lock only - - - -| | + | | +--------+ + +--------+ ... etc ... +  + The principal use case for this model is that threads can do most of + their work with local data and read access to the shared data, and + comparatively seldom require the global write lock to update the + shared data. Every shared thing does not have its own lock since that + would cause excessive lock overhead. +  + Note that the shared data is typically the same as the data referenced + from the common environment (i.e. the "global data"). +  + Also note that the current object (this) always is shared in pike + modules, so a thread cannot assume free access to it. In other pike + classes it would often be shared too, but it is still important to + utilize the situation when it is thread local. See issue "Function + calls". +  + A thread local thing, and all the things it references directly or + indirectly, automatically becomes shared whenever it gets referenced + from a shared thing. +  + A shared thing never automatically becomes thread local, but there is + a function to explicitly "take" it. It would first have to make sure + there are no references to it from shared or other thread local + things. Thread.Queue has a special case so that if a thread local + thing with no other refs is enqueued, it is disowned by the current + thread, and later becomes thread local in the thread that dequeues it. +  +  + Issue: Lock spaces +  + Having a single global read/write lock for all shared data could + become a bottleneck. Thus there is a need for shared data with locks + separate from the global lock. Things that share a common lock is + called a "lock space", and it is always possible to look up the lock + that governs any given thing (see issue "Memory object structure"). +  + A special global lock space, which corresponds to the shared data + discussed above, is created on startup. All others have to be created + explicitly. +  + The intended use case for lock spaces is a "moderately large" + collection of things: Too large and you get outlocking problems, too + small and the lock overhead (both execution- and memorywise) gets + prohibiting. A typical lock space could be a RAM cache consisting of a + mapping and all its content. +  + Many different varieties of lock space locks can be considered, e.g. a + simple exclusive access mutex lock or a read/write lock, priority + locks, locks that ensure fairness, etc. Therefore different (C-level) + implementations should be allowed. +  + One important characteristic of lock space locks is whether they are + implicit or explicit: +  + Implicit locks are locked internally, without intervention on the pike + level. The lock duration is unspecified; locks are only acquired to + ensure internal consistency. All low level data access functions check + whether the lock space for the accessed thing is locked already. If it + isn't then the lock is acquired automatically. All implicit locks have + a well defined lock order (by pointer comparison), and since they only + are taken to guarantee internal consistency, an access function can + always release a lock to ensure correct order (see also issue "Lock + space locking"). +  + Explicit locks are exposed to the pike level and must be locked in a + similar way to Thread.Mutex. If a low level data access function + encounters an explicit lock that isn't locked, it throws an error. + Thus it is left to the pike programmer to avoid deadlocks, but the + pike core won't cause any by itself. Since the pike core keeps track + which lock governs which thing it ensures that no lock violating + access occurs, which is a valuable aid to ensure correctness. +  + One can also consider a variant with a read/write lock space lock that + is implicit for read but explicit for write, thus combining atomic + pike-level updates with the convenience of implicit locking for read + access. +  + The scope of a lock space lock is (at least) the state inside all the + things it contains, but not the set of things itself, i.e. things + might be added to a lock space without holding a write lock (provided + the memory structure allows it). Removing a thing from a lock space + always requires the write lock since that is necessary to ensure that + a lock actually governs a thing for as long as it is held (regardless + it's for reading or writing). +  + FIXME: Allow removing garbage from a lock space without the write + lock? +  + See also issues "Memory object structure" and "Lock space locking" for + more details. +  +  + Issue: Memory object structure +  + Of concern are the refcounted memory objects known to the gc. They are + called "things", to avoid confusion with "objects" which are the + structs for pike objects. +  + There are three types of things: +  + o First class things with ref counter, lock space pointer, and +  double-linked list pointers (to be able to visit all things in +  memory, regardless of other references). Most pike visible types +  are first class things. The exceptions are ints and floats, which +  are passed by value, and strings and types. +  + o Second class things with ref counter and lock space pointer but no +  double-linked list pointers. These are always reached through +  pointers from one or more first class things. It's the job of the +  visit functions for those first class things to ensure that the gc +  visits these, thus they don't need the double-linked list pointers. +  Only strings and types are likely to be of this type. +  + o Third class things contain only a ref counter. They are similar to +  second class except that their lock spaces are implicit from the +  referencing things, which means all those things must always be in +  the same lock space. +  + Thread local things could have NULL as lock space pointer, but as a + debug measure they could also point to the thread object so that it's + possible to detect bugs with a thread accessing things local to + another thread. +  + Before the multi-cpu architecture, all first class things are linked + into the same global double-linked lists (one for each type: array, + mapping, multiset, object, and program). This gets split into one set + of double-linked lists for each thread and for each lock space. That + allows things to be added and removed to a thread or lock space + without requiring other locks (a lock-free double-linked list is + apparently difficult to accomplish). It also allows the gc to do + garbage collection locally in each thread and in each lock space + (although cyclic structures over several lock spaces won't be freed + that way). +  + A global lock-free hash table (see issue "Lock-free hash table") is + used to keep track of all lock space lock objects, and hence all + things they contain in their double-linked lists. +  +  +----------+ +----------+ +  | Thread 1 | | Thread 2 | +  +----------+ +----------+ +  // \\ // \\ // \\ // \\ +  ,--- O O O O ,------------- O O O O ---. +  | \\ // \\ // | \\ // \\ // | +  ref | O O -. | ref O O | ref +  | | | | +  v refs ref | v v +  O <----- O `--> O O O O +  // \\ // \\ // \\ // \\ refs // \\ // \\ +  O O -> O O O O O O <----> O O O O +  \\ // \\ // \\ // \\ // \\ // \\ // +  +--------------+ +--------------+ +--------------+ +  | Lock space 1 | | Lock space 2 | | Lock space 3 | +  +--------------+ +--------------+ +--------------+ +  ^________ ^ ____^ +  | | | + +-----------------------+-|-+-----+-|-+-------+-|-+----------------- + | | X | | X | | X | ... + +-----------------------+---+-----+---+-------+---+----------------- +  + Figure 2: "Space Invaders". The O's represent things, and the \\ and + // represent the double-linked lists. Some examples of references + between things are included, and at the bottom is the global hash + table with pointers to all lock spaces. +  + Accessing a lock space lock structure from the global hash table + requires a hazard pointer (c.f. issue "Hazard pointers"). Accessing it + from a thing is safe if the thread controls at least one ref to the + thing, because a lock space has to be empty to delete the lock space + lock struct. +  +  + Issue: Lock space lock semantics +  + There are three types of locks: +  + o A read-safe lock ensures only that the data is consistent, not that +  it stays constant. This allows lock-free updates in things where +  possible (which could include arrays, mappings, and maybe even +  multisets and objects of selected classes). +  + o A read-constant lock ensures both consistency and constantness +  (i.e. what usually is assumed for a read-only lock). +  + o A write lock ensures complete exclusive access. The owning thread +  can modify the data, and it can assume no other changes occur to it +  (barring refcounters - see below). The owning thread can also under +  limited time leave the data in inconsistent state. This is however +  still limited by the calls to check_threads(), which means that the +  state must be consistent again every time the evaluator callbacks +  are run. See issue "Emulating the interpreter lock". +  + Allowing lock-free updates is attractive, so the standard read/write + lock that governs the global lock space will probably be multiple + read-safe/single write. +  + An exception to the lock semantics above are the reference counters in + refcounted things (c.f. issue "Refcounting and shared data"). A ref to + a thing can always be added or removed if it is certain that the thing + cannot asynchronously disappear. That means: +  + o Refcount changes must always be atomic, even when a write lock is +  held. + o The refcount may be incremented or decremented when any kind of +  read lock is held. + o The refcount may be incremented or decremented without any kind of +  lock at all, provided the same thread already holds at least one +  other ref to the same thing. This means another thread might hold a +  write lock, but it still won't free the thing since the refcount +  never can reach zero. + o A thing may be freed if its refcount is zero and a write lock is +  held. +  + FIXME: Whether or not to free a thing if its refcount is zero and only + some kind of read lock is held is tricky. To allow that it's necessary + to have an atomic-decrement-and-get instruction (can be emulated with + CAS, though) to ensure no other thread is decrementing it and reaching + zero at the same time. Lock-free linked lists are also necessary to + make unlinking possible. Barring that, we need to figure out a policy + for scheduling frees of things reaching refcount zero during read + locks. +  +  + Issue: Lock space locking +  + Assuming that a thread already controls at least one ref to a thing + (so it won't be freed asynchronously), this is the locking process + before accessing it: +  + 1. Read the lock space pointer. If it's NULL then the thing is thread +  local and nothing more needs to be done. + 2. Address an array containing the pointers to the lock spaces that +  are already locked by the thread. + 3. Search for the lock space pointer in the array. If present then +  nothing more needs to be done. + 4. Lock the lock space lock as appropriate. Note that this can imply +  other implicit locks that are held are unlocked to ensure correct +  lock order (see issue "Lock spaces"). Then it's added to the +  array. +  + A thread typically won't hold more than a few locks at any time (less + than ten or so), so a plain array and linear search should perform + well. For quickest possible access the array should be a static thread + local variable (c.f. issue "Thread local storage"). If the array gets + full, implicit locks in it can be released automatically to make + space. Still, a system where more arrays can be allocated and chained + on would perhaps be prudent to avoid the theoretical possibility of + running out of space for locked locks. +  + "Controlling" a ref means either to add one "for the stack", or + ensuring a lock on a thing that holds a ref. Note that implicit locks + might be released in step 4, so unless the thread controls a ref to + the referring thing too, it might no longer exist afterwards, and + hence the thing itself might be gone. +  + Since implicit locks can be released (almost) at will, they are open + for performance tuning: Too long lock durations and they'll outlock + other threads, too short and the locking overhead becomes more + significant. As a starting point, it seems reasonable to release them + at every evaluator callback call (i.e. at approximately every pike + function call and return). +  +  + Issue: Refcounting and shared data +  + Using the traditional refcounting on shared data could easily produce + hotspots: Some strings, shared constants, and the object instances for + pike modules are often accessed from many threads, so their refcounts + would be changed frequently from different processors. +  + E.g. making a single function call in a pike module requires the + refcount of the module object to be increased during the call since + there is a new reference from a pike_frame. The refcounters in the + module objects for commonly used modules like Stdio.pmod/module.pmod + could easily become hotspots. +  + Atomic increments and decrements are not enough to overcome this - the + memory must not be changed at all to avoid slow synchronizations + between cpu local caches. +  + Observation: Refcounters become hotspots primarily in globally + accessible shared data, which for the most part has a long lifetime + (i.e. programs, module objects, and constants). Otoh, they are most + valuable in short-lived data (shared or not), which would produce lots + of garbage if they were to be reaped by the gc instead. +  + Following this observation, the problem with refcounter hotspots can + to a large degree be mitigated by simply turning off refcounting in + the large body of practically static data in the shared runtime + environment. +  + A good way to do that is to extend the resolver in the master to mark + all programs it compiles, their constants, and the module objects, so + that refcounting of them is disabled. To do this, there has to be a + function similar to Pike.count_memory that can walk through a + structure recursively and mark everything in it. When those things + lose their refs, they will always become garbage that only is freed by + the gc. +  + Question: Is there data that is missed with this approach? +  + A disabled refcounter is recognized by a negative value and flagged by + setting the topmost two bits to one and the rest to zero, i.e. a value + in the middle of the negative range. That way, in case there is code + that steps the refcounter then it stays negative. (Such code is still + bad for performance and should be fixed, though.) +  + Disabling refcounting requires the gc to operate differently; see + issue "Garbage collection and external references". +  +  + Issue: Strings +  + Strings are unique in Pike. This property is hard to keep if threads + have local string pools, since a thread local string might become + shared at any moment, and thus would need to be moved. Therefore the + string hash table remains global, and lock congestion is avoided with + some concurrent access hash table implementation. See issue "Lock-free + hash table". +  + Lock-free is a good start, but the hash function must also provide a + good even distribution to avoid hotspots. Pike currently uses an + in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a + more widespread and better studied alternative should be considered. + There seems to be few that are below O(n) (which DO_HASHMEM is), + though. +  +  + Issue: Types +  + Like strings, types are globally unique and always shared in Pike. + That means lock-free access to them is desirable, and it should also + be doable fairly easily since they are constant (except for the + refcounts which can be updated atomically). Otoh it's probably not as + vital as for strings since types typically only are built during + compilation. +  + Types are more or less always part of global shared data. That + suggests they should have their refcounts disabled most of the time + (see issue "Refcounting and shared data"). But again, since types + typically only get built during compilation, their refcounts probably + won't become hotspots anyway. So it looks like they could be exempt + from that rule. +  +  + Issue: Shared mapping and multiset data blocks +  + An interesting issue is if things like mapping/multiset data blocks + should be second or third class things (c.f. issue "Memory object + structure"). If they're third class it means copy-on-write behavior + doesn't work across lock spaces. If they're second class it means + additional overhead handling the lock spaces of the mapping data + blocks, and if a mapping data is shared between lock spaces then it + has to be in some third lock space of its own, or in the global lock + space, neither of which would be very good. +  + So it doesn't look like there's a better way than to botch + copy-on-write in this case. +  +  + Issue: Emulating the interpreter lock +  + For compatibility with old C modules, and for the _disable_threads + function, it is necessary to retain a complete lock like the current + interpretator lock. It has to lock the global area for writing, and + also stop all access to all lock spaces, since the thread local data + might refer to any lock space. +  + This lock is implemented as a read/write lock, which normally is held + permanently for reading by all threads. Only when a thread is waiting + to acquire the compat interpreter lock is it released as each thread + goes into check_threads(). +  + This lock cannot wait for explicit lock space locks to be released. + Thus it can override the assumption that a lock space is safe from + tampering by holding a write lock on it. Still, it's only available + from the C level (with the exception of _disable_threads) so the + situation is not any different from the way the interpreter lock + overrides Thread.Mutex today. +  +  + Issue: Function calls +  + A lock on an object is almost always necessary before calling a + function in it. Therefore the central apply function (mega_apply) must + ensure an appropriate lock is taken. Which kind of lock + (read-safe/read-constant/write - see issue "Lock space lock + semantics") depends on what the function wants to do. Therefore all + object functions are extended with flags for this. +  + The best default is probably read-safe. Flags for no locking (for the + few special cases where the implementations actually are completely + lock-free) and for compat-interpreter-lock-locking should probably + exist as well. A compat-interpreter-lock flag is also necessary for + global functions that don't have a "this" object (aka efuns). +  + Having the required locking declared this way also alleviates each + function from the burden of doing the locking to access the current + storage, and it allows future compiler optimizations to minimize lock + operations. +  +  + Issue: Exceptions +  + "Forgotten" locks after exceptions shouldn't be a problem: Explicit + locks are handled just like today (i.e. it's up to the pike + programmer), and implicit locks can safely be released when an + exception is thrown. +  + One case requires attention: An old-style function that requires the + compat interpreter lock might catch an error. In that case the error + system has to ensure that lock is reacquired. +  +  + Issue: C module interface +  + A new add_function variant is probably added for new-style functions. + It takes bits for the flags discussed for issue "Function calls". + New-style functions can only assume free access to the current storage + according to those flags; everything else must be locked (through a + new set of macros/functions). +  + Accessor functions for data types (e.g. add_shared_strings, + mapping_lookup, and object_index_no_free) handles the necessary + locking internally. They will only assume that the thing is safe, i.e. + that the caller ensures the current thread controls at least one ref. +  + THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in + new-style functions. +  + There will be new GC callbacks for walking module global pointers to + things (see issue "Garbage collection and external references"). +  +  + Issue: C module compatibility +  + Ref issue "Emulating the interpreter lock". +  + Ref issue "Garbage collection and external references". +  +  + Issue: Garbage collection and external references +  + The current gc design is that there is an initial "check" pass that + determines external references by counting all internal references, + and then for each thing subtract it from its refcount. If the result + isn't zero then there are external references (e.g. from global C + variables or from the C stack) and the thing is not garbage. +  + Since refcounting can be disabled in some objects (see issue + "Refcounting and shared data"), this approach no longer work; the gc + has to be changed to find external references some other way: +  + References from global C variables are few, so they can be dealt with + by requiring C modules and the core parts to provide callbacks that + lets the gc walk through them (see issue "C module interface"). This + is however not compatible with old C modules. +  + References from C stacks are common, and it is infeasible to require + callbacks that keep track of them. The gc instead has to scan the C + stacks for the threads and treat any aligned machine word containing + an apparently valid pointer to a known thing as an external reference. + This is the common approach used by standalone gc libraries that don't + require application support. For reference, here is one such garbage + collector, written in C++: + + Its source is here: + +  + The same approach is also necessary to cope with old C modules (see + issue "C module compatibility"), but since global C level pointers are + few, it might not be mandatory to get this working. +  + Btw, using this approach to find external refs should be considerably + more efficient than the old "check" pass, even if C stacks are scanned + wholesale. +  +  + Issue: Local garbage collection +  + Each thread periodically invokes a gc that only looks for garbage in + the local data of that thread. This can naturally be done without + disturbing the other threads. It follows that this gc also can be + disabled on a per-thread basis. This is a reason for keeping thread + local data in separate double-linked lists (see issue "Memory object + structure"). +  + Similarly, if gc statistics are added to each lock space, they could + also be gc'd for internal garbage at appropriate times when they get + write locked by some thread. That might be interesting since known + cyclic structures could then be put in lock spaces of their own and be + gc'd efficiently without a global gc. Note that a global gc is still + required to clean up cycles with things in more than one lock space. +  +  + Issue: Global pike level caches +  +  + Issue: Thread.Queue +  + A lock-free implementation should be used. The things in the queue are + typically disowned to allow them to become thread local in the reading + thread. +  +  + Issue: False sharing +  + False sharing occurs when thread local things used frequently by + different threads are next to each other so that they share the same + cache line. Thus the cpu caches might force frequent resynchronization + of the cache line even though there is no apparent hotspot problem on + the C level. +  + This can be a problem in particular for all the block_alloc pools + containing small structs. Using thread local pools is seldom a + workable solution since most thread local structs might become shared + later on. +  + One way to avoid it is to add padding (and alignment). Cache line + sizes are usually 64 bytes or less (at least for Intel ia32). That + should be small enough to make this viable in many cases. +  + FIXME: Check cache line sizes on the other important architectures. +  + Worth noting that the problem is greatest for the frequently changed + ref counters at the start of each thing, so the most important thing + is to keep ref counters separated. I.e. things larger than a cache + line can probably be packed without padding. +  + Another way is to move things when they get shared, but that is pretty + complicated and slow. +  +  + Issue: Malloc and block_alloc +  + Standard OS mallocs are usually locking. Bundling a lock-free one + could be important. FIXME: Survey free implementations. +  + Block_alloc is a simple homebrew memory manager used in several + different places to allocate fixed-size blocks. The block_alloc pools + are often shared, so they must allow efficient concurrent access. With + a modern malloc, it is possible that the need for block_alloc is gone, + or perhaps the malloc lib has builtin support for fixed-size pools. + Making a lock-free implementation is nontrivial, so the homebrew ought + to be ditched in any case. +  + A problem with ditching block_alloc is that there is some code that + walks through all allocated blocks in a pool, and also avoids garbage + by freeing the whole pool altogether. FIXME: Investigate alternatives + here. +  + See also issue "False sharing". +  +  + Issue: The compiler +  +  + Issue: Foreign thread visits +  + JVM threads.. +  +  + Issue: Pike security system +  + It is possible that keeping the pike security system intact would + complicate the implementation, and even if it was kept intact a lot of + testing would be required before one can be confident that it really + works (and there are currently very few tests for it in the test + suite). +  + Also, the security system isn't used at all to my (mast's) knowledge, + and it is not even compiled in by default (has to be enabled with a + configure flag). +  + All this leads to the conclusion that it is easiest to ignore the + security system altogether, and if possible leave it as it is with the + option to get it working later. +  +  + Issue: Contention-free counters +  + There is probably a need for contention-free counters in several + different areas. They should be possible to update from several + threads in parallell without synchronization. Querying the current + count is always approximate since it can be changing simultaneously in + other threads. However, the thread's own local count is always + accurate. +  + They should be separated from the blocks they apply to, to avoid cache + line invalidation of those blocks. +  + To accomplish that, a generic tool somewhat similar to block_alloc is + created that allocates one or more counter blocks for each thread. In + these blocks indexes are allocated, so a counter is defined by the + same index into all the thread local counter blocks. +  + Each thread can then modify its own counters without locking, and it + typically has its own counter blocks in the local cache while the + corresponding main memory is marked invalid. To query a counter, a + thread would need to read the blocks for all other threads. +  + This means that these counters are efficient for updates but less so + for queries. However, since queries always are approximate, it is + possible to cache them for some time (e.g. 1 ms). Each thread would + need its own cache though, since the local count cannot be cached. +  + It should be lock-free for allocating and freeing counters, and + preferably also for starting and stopping threads (c.f. issue "Foreign + thread visits"). In both cases the freeing steps represents a race + problem - see issue "Hazard pointers". To free counters, the counter + index would constitute the hazard pointer. +  +  + Issue: Lock-free hash table +  + A good lock-free hash table implementation is necessary. A promising + one is + It requires a CAS (Compare And Swap) instruction to work, but that + shouldn't be a problem. The java implementation + ( is Public Domain. In + the comments there is talk about efforts to make a C version. +  + It supports (through putIfAbsent) the uniqueness requirement for + strings, i.e. if several threads try to add the same string (at + different addresses) then all will end up with the same string pointer + afterwards. +  + The java implementation relies on the gc to free up the old hash + tables after resize. We don't have that convenience, but the problem + is still solvable; see issue "Hazard pointers". +  +  + Issue: Hazard pointers +  + A problem with most lock-free algorithms is how to know no other + thread is accessing a block that is about to be freed. Another is the + ABA problem which can occur when a block is freed and immediately + allocated again (common for block_alloc). +  + Hazard pointers are a good way to solve these problems without leaving + the blocks to the garbage collector (see + So a + generic hazard pointer tool is necessary. +  + Note however that a more difficult variant of the ABA problem still + can occur when the block cannot be freed after leaving the data + structure. (In the canonical example with a lock-free stack - see e.g. + "ABA problem" in Wikipedia - consider the case when A is a thing that + continues to live on and actually gets pushed back.) The only reliable + way to cope with that is probably to use wrappers. +  +  + Issue: Thread local storage +  + Implementation would be considerably simpler if working TLS can be + assumed on the C level, through the __thread keyword (or + __declspec(thread) in Visual C++). A survey of the support for TLS in + common compilers and OS'es is needed to decide whether this is an + workable assumption: +  + o GCC: __thread is supported. Source: Wikipedia. +  FIXME: Check from which version. +  + o Visual C++: __declspec(thread) is supported. Source: Wikipedia. +  FIXME: Check from which version. +  + o Intel C compiler: Support exists. Source: Wikipedia. +  FIXME: Check from which version. +  + o Sun C compiler: Support exists. Source: Wikipedia. +  FIXME: Check from which version. +  + o Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works +  for dynamic libs. C.f. +  FIXME: Check from which version of glibc and kernel (if relevant). +  + o Windows (i386, x86_64): TLS is supported but does not always work +  in dll's loaded using LoadLibrary (which means all dynamic modules +  in pike). C.f. +  According to Wikipedia this is fixed in Vista and Server 2008 +  (FIXME: verify). In any case, TLS is still usable in the pike core. +  + o MacOS X: FIXME: Check this. +  + o Solaris: FIXME: Check this. +  + o *BSD: FIXME: Check this. +  +  + Issue: Platform specific primitives +  + Some low-level primitives, such as CAS and fences, are necessary to + build the various lock-free tools. A third-party library would be + useful. +  + An effort to make a standardized library is here: + (C + level interface at the end). It apparently lacks implementation, + though. +  + Required operations: +  + CAS(address, old_value, new_value) +  Compare-and-set: Atomically sets *address to new_value iff its +  current value is old_value. Needed for 32-bit variables, and on +  64-bit systems also for 64-bit variables. +  + ATOMIC_INC(address) + ATOMIC_DEC(address) +  Increments/decrements *address atomically. Can be simulated with +  CAS. 32-bit version necessary, 64-bit version would be nice. +  + LFENCE() +  Load fence: All memory reads in the thread before this point are +  guaranteed to be done (i.e. be globally visible) before any +  following it. +  + SFENCE() +  Store fence: All memory writes in the thread before this point are +  guaranteed to be done before any following it. +  + MFENCE() +  Memory fence: Both load and store fence at the same time. (On many +  architectures this is implied by CAS etc, but we shouldn't assume +  that.) +  + The following operations are uncertain - still not known if they're + useful and supported enough to be required, or if it's better to do + without them: +  + CASW(address, old_value_low, old_value_high, new_value_low, new_value_high) +  A compare-and-set that works on a double pointer size area. +  Supported on more modern x86 and x86_64 processors (c.f. + +  + FIXME: More.. +  + Survey of platform support: +  + o Windows/Visual Studio: Got "Interlocked Variable Access": + +  + o FIXME: More.. +  +  + Various links +  + Pragmatic nonblocking synchronization for real-time systems + + DCAS is not a silver bullet for nonblocking algorithm design +   Newline at end of file added.