pike.git / multi-cpu.txt

version» Context lines:

pike.git/multi-cpu.txt:202:   things it contains, but not the set of things itself, i.e. things   might be added to a lock space without holding a write lock. Removing   a thing from a lock space always requires the write lock since that is   necessary to ensure that a lock actually governs a thing for as long   as it is held (regardless it's for reading or writing).      See also issues "Memory object structure" and "Lock space locking" for   more details.       - Issue: Garbage collector -  - Pike has used refcounting to collect noncyclic structures, combined - with a stop-the-world periodical collector for cyclic structures. The - periodic pauses are already a problem, and it only gets worse as the - heap size and number of concurrent threads increase. Since the gc - needs an overhaul anyway, it makes sense to replace it with a more - modern solution. -  - PHD-200-10.ps [FIXME: ref] is a recent thesis work that combines - several state-of-the-art gc algorithms to an efficient whole: It - describes a generational collector that uses deferred-update - refcounting for old things with on-the-fly collection, and on-the-fly - mark-and-sweep for young things. An on-the-fly cycle detector is also - employed for the refcounted area. See said work for rationale and - details. -  - Effects of using this in Pike: -  - a. References from the C or pike stacks don't need any handling at -  all (see also issue "Garbage collection and external references"). -  - b. Special code is used to update refs in the heap. During certain -  circumstances, before changing a pointer inside a thing which can -  point to another thing, the state of all non-NULL pointers in it -  are copied to a thread local log. -  - c. A new LogPointer field is required per thing. If a state copy has -  taken place as described above, it points to the log that contains -  the original pointer state of the thing. -  -  Data containers that can be of arbitrary size (i.e. arrays, -  mappings and multisets) should be segmented into fixed-sized -  chunks with one LogPointer each, so that the state copy doesn't -  get arbitrarily large. -  - d. The double-linked lists aren't needed. Hence two pointers less per -  thing. -  - e. The refcounter word is changed to hold both normal refcount, weak -  count(?), and flags. Overflowed counts are stored in a separate -  hash table. -  - f. The collector typically runs concurrently with the rest of the -  program. It sometimes interrupts the other threads for handshakes. -  These interrupts are not aligned with the evaluator callback -  calls, since that would cause too much pausing of the collector -  thread. This requires that threads can be stopped and resumed -  externally. FIXME: Verify this in pthreads and on windows. -  - g. All garbage collection, both for noncyclic and cyclic garbage, are -  discovered and handled by the gc thread. The other threads never -  frees any block known to the gc. -  - g. An effect of the above is that all garbage is discovered by a -  separate collector thread which doesn't execute any other pike -  code. This opens up the issue on how to call destruct functions. -  -  At least thread local things should reasonably get their destruct -  calls in that thread. A problem is however what to do when that -  thread has exited or emigrated (see issue "Foreign thread -  visits"). -  -  For shared things it's not clear which thread should call destruct -  anyway, so in that case any thread could do it. It might however -  be a good idea to not do it directly in the gc thread, since doing -  so would require that thread too to be a proper pike thread with -  pike stack etc; it seems better to keep it an "invisible" -  low-level thread outside the "worker" threads. In programs with a -  "backend thread" it could be useful to allow the gc thread wake up -  the backend thread to let it execute the destruct calls. -  - h. The most bothersome problem is that things are no longer freed -  right away when running out of refs. This behavior in Pike is used -  implicitly in many places, mainly to release locks timely by just -  putting them in a local variable that gets freed when the function -  exits (either by normal return or by exception). -  -  Maybe a solution can be devised to keep this characteristic in -  that special case, i.e. when a thread local thing only got a -  single reference from the stack. This should be easy to detect in -  the compiler: It's an assignment to a local variable that never -  gets referenced. That's already a warning, but it is currently -  tuned down to not warn in these cases (precisely to allow this -  problematic idiom). -  -  So the compiler could in such cases add implicit destruct calls on -  function exit. Consider however if someone adds e.g. a werror call -  to print out a description of the MutexKey object. The local -  variable is referenced, but one won't expect that the innocent -  werror() would change the freeing of the MutexKey. It's therefore -  probably better to strengthen the compiler warning and require -  people to deal with it on the Pike level (a werror for debug -  purposes is unlikely to be there permanently, at least). -  -  Question: Are there more cases where pike programmers expect -  immediate frees? -  - i. FIXME: How to solve weak refs? -  - j. One might consider separating the refcounts from the things by -  using a hash table. This makes sense when considering that only -  the collector thread is using the refcounts, thereby avoiding -  false aliasing occurring from refcounter updates (and other gc -  related flags) by that thread. -  -  All the hash table lookups would however incur a significant -  overhead in the gc thread. A better alternative would be to use a -  bitmap based on the possible allocation slots used by the malloc -  implementation, but that would require very tight integration with -  the malloc system. The bitmap could work with only two bits per -  refcounter - research shows that most objects in a refcounted heap -  have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3 -  would then be stored in a hash table. -  - k. FIXME: Is the third NOP handshake really necessary? -  - To simplify memory handling, the gc should be used consistently on all - heap structs, regardless whether they are pike visible things or not. - An interesting question is whether the type info for every struct - (more concretely, the address of some area where the gc can find the - functions it needs to handle the struct) is carried in the struct - itself (through a new pointer field), or if it continues to be carried - in the context for every pointer to the struct (e.g. in the type field - in svalues). -  -  +    Issue: Memory object structure      Of concern are the memory objects known to the gc. They are called   "things", to avoid confusion with "objects" which are the structs for   pike objects.      There are two types of things:      o First class things with gc header and lock space pointer. Most pike    visible types are first class things. The exceptions are ints and
pike.git/multi-cpu.txt:454:   running out of space for locked locks.      Since implicit locks can be released (almost) at will, they are open   for performance tuning: Too long lock durations and they'll outlock   other threads, too short and the locking overhead becomes more   significant. As a starting point, it seems reasonable to release them   at every evaluator callback call (i.e. at approximately every pike   function call and return).       + Issue: Garbage collector +  + Pike has used refcounting to collect noncyclic structures, combined + with a stop-the-world periodical collector for cyclic structures. The + periodic pauses are already a problem, and it only gets worse as the + heap size and number of concurrent threads increase. Since the gc + needs an overhaul anyway, it makes sense to replace it with a more + modern solution. +  + http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2006/PHD/PHD-2006-10.ps + is a recent thesis work that combines several state-of-the-art gc + algorithms to an efficient whole. A brief overview of the highlights: +  + o The reference counts aren't updated for references on the stack. +  The stacks are scanned when the gc runs instead. This saves a great +  deal of refcount updates, and it also simplifies C level +  programming a lot. Only refcounts between things on the heap are +  counted. +  + o The refcounts are only updated when the gc runs. This saves a lot +  of the remaining updates since if a pointer starts with value p_0 +  and then changes to p_1, then p_2, p_3, ..., and lastly to p_n at +  the next gc, then only p_0->refs needs to be decremented and +  p_n->refs needs to be incremented - the changes in all the other +  refcounts, for the things pointed to in between, cancel out. +  + o The above is accomplished by thread local logging, to make the old +  p_0 value available to the gc at the next run. This means it scales +  well with many cpu's. +  + o A generational gc uses refcounting only for old things in the heap. +  New things, which are typically very short-lived, aren't refcounted +  at all but instead gc'ed using a mark-and-sweep collector. This is +  shown to be more efficient for short-lived data, and it handles +  cyclic structures without any extra effort. +  + o By using refcounting on old data, the gc only need to give +  attention to refcounts that gets down to zero. This means the heap +  can scale to any size without affecting the gc run time, as opposed +  to using a mark-and-sweep collector on the whole heap. Thus the gc +  time scales only with the amount of _change_ in the heap. +  + o Cyclic structures in the old refcounted data is handled +  incrementally using the fact that a cyclic structure can only occur +  when a refcounter is decremented to a value greater than zero. +  Those things can therefore be tracked and cycle checked in the +  background. The gc uses several different methods to weed out false +  alarms before doing actual cycle checks. +  + o The gc runs entirely in its own thread. It only needs to stop the +  working threads for a very short time to scan stacks etc, and they +  can be stopped one at a time. +  + Effects of using this in Pike: +  + a. References from the C or pike stacks don't need any handling at +  all (see also issue "Garbage collection and external references"). +  + b. A significant complication in various lock-free algorithms is the +  safe freeing of old blocks (see e.g. issue "Lock-free hash +  table"). This gc would solve almost all such problems in a +  convenient way. +  + c. Special code is used to update refs in the heap. During certain +  circumstances, before changing a pointer inside a thing which can +  point to another thing, the state of all non-NULL pointers in it +  are copied to a thread local log. +  +  This is mostly problematic since it requires that every pointer +  assignment inside a thing is replaced with a macro or function +  call, which has a big impact on C code. See issue "C module +  interface". +  + d. A new log_pointer field is required per thing. If a state copy has +  taken place as described above, it points to the log that contains +  the original pointer state of the thing. +  +  Data containers that can be of arbitrary size (i.e. arrays, +  mappings and multisets) should be segmented into fixed-sized +  chunks with one log_pointer each, so that the state copy doesn't +  get arbitrarily large. +  + e. The double-linked lists aren't needed. Hence two pointers less per +  thing. +  + f. The refcounter word is changed to hold both normal refcount, weak +  count, and flags. Overflowed counts are stored in a separate hash +  table. +  + g. The collector typically runs concurrently with the rest of the +  program. It sometimes interrupts the other threads for handshakes. +  These interrupts are not aligned with the evaluator callback +  calls, since that would cause too much pausing of the collector +  thread. This requires that threads can be stopped and resumed +  preemptively. See issue "Preemptive thread suspension". +  + h. All garbage collection, both for noncyclic and cyclic garbage, are +  discovered and handled by the gc thread. The other threads never +  frees any block known to the gc. +  + i. An effect of the above is that all garbage is discovered by a +  separate collector thread which doesn't execute any other pike +  code. This opens up the issue on how to call destruct functions. +  +  At least thread local things should reasonably get their destruct +  calls in that thread. A problem is however what to do when that +  thread has exited or emigrated (see issue "Foreign thread +  visits"). +  +  For shared things it's not clear which thread should call destruct +  anyway, so in that case any thread could do it. It might however +  be a good idea to not do it directly in the gc thread, since doing +  so would require that thread too to be a proper pike thread with +  pike stack etc; it seems better to keep it an "invisible" +  low-level thread outside the "worker" threads. In programs with a +  "backend thread" it could be useful to allow the gc thread wake up +  the backend thread to let it execute the destruct calls. +  + j. The most bothersome problem is that things are no longer freed +  right away when running out of refs. See issue "Immediate +  destruct/free when refcount reaches zero". +  + k. FIXME: How to solve weak refs? +  + l. One might consider separating the refcounts from the things by +  using a hash table. This makes sense when considering that only +  the collector thread is using the refcounts, thereby avoiding +  false aliasing occurring from refcounter updates (and other gc +  related flags) by that thread. +  +  All the hash table lookups would however incur a significant +  overhead in the gc thread. A better alternative would be to use a +  bitmap based on the possible allocation slots used by the malloc +  implementation, but that would require very tight integration with +  the malloc system. The bitmap could work with only two bits per +  refcounter - research shows that most objects in a refcounted heap +  have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3 +  would then be stored in a hash table. +  + m. FIXME: Is the third NOP handshake really necessary? +  + To simplify memory handling, the gc should be used consistently on all + heap structs, regardless whether they are pike visible things or not. + An interesting question is whether the type info for every struct + (more concretely, the address of some area where the gc can find the + functions it needs to handle the struct) is carried in the struct + itself (through a new pointer field), or if it continues to be carried + in the context for every pointer to the struct (e.g. in the type field + in svalues). +  + Since the gc would be used for most internal structs as well, which + are almost exclusively used via compile-time typed pointers, it would + probably save significant heap space to retain the type in the pointer + context. It does otoh complicate the gc - everywhere where the gc is + fed a pointer to a thing, it must also be fed a type info pointer, and + the gc must then keep track of this data tuple internally. +  +  + Issue: Immediate destruct/free when refcount reaches zero +  + When a thing in Pike runs out of references, it's destructed and freed + almost immediately in the pre-multi-cpu implementation. This behavior + in Pike is used implicitly in many places. The major (hopefully all) + principal use cases of concern are: +  + 1. It's popular to make code that releases a lock timely by just +  storing it in a local variable that gets freed when the function +  exits (either by normal return or by exception). E.g: +  +  void foo() { +  Thread.MutexKey my_lock = my_mutex->lock(); +  ... do some work ... +  // my_lock falls out of scope here when the function exits +  // (also if it's due to a thrown exception), so the lock is +  // released right away. +  } +  +  There's also code that opens files and sockets etc, and expects +  them to be automatically closed again through this method. (That +  practice has been shown to be bug prone, though, so in the sources +  at Roxen many of those places have been fixed over time.) +  + 2. In some cases, structures are carefully kept acyclic to make them +  get freed quickly, and there is no control of which party that got +  the "last reference". +  +  One example is if a cache holds one ref to an entry, and there +  might at the same time be one or more worker threads that hold +  references to the same entry while they use it. In this case the +  cache can be pruned safely by dropping the reference to the entry, +  without destructing it. +  +  A variant when the structure cannot be made acyclic is to make a +  "wrapper object": It holds a reference to the cyclic structure, +  and all other parties makes sure to hold a ref to the wrapper as +  long as they got interest in any part of the data. When the +  wrapper runs out of refs, it destructs the cyclic structure +  explicitly. +  +  These tricks have mostly been used to reduce the amount of cyclic +  garbage that requires the stop-the-world gc to run more often, but +  there are also occasions when the structure holds open fd's which +  must be closed without delay (one such occasion is the connection +  fd in the http protocol in the Roxen WebServer). +  + 3. In some applications with extremely high data mutation rate, the +  immediate freeing of acyclic structures is seen as a prerequisite +  to keep bounds on memory consumption. +  + 4. FIXME: Are there more? +  + The proposed gc (c.f. issue "Garbage collector") does not retain the + immediate destruct and free semantic - only the gc running in its own + thread may free things. Although it would run much more often than the + old gc (probably on the order of once a minute up to several times a + second), it would still break this semantic. To discuss each use case + above: +  + 1. Locks, and in some cases also open fd's, cannot wait until the +  next gc run. +  +  Observing that mutex locks always are thread local things, almost +  all these cases (exceptions are possibly fd objects that somehow +  are shared anyway) can be solved by a modified gc approach - see +  issue "Micro-gc". +  +  Since the micro-gc approach appears to be expensive, it's worth +  considering to actually ditch this behavior and solve the problem +  on the pike level instead. The compiler can be used to detect many +  of these cases by looking for assignments to local variables that +  aren't accessed from anywhere (there is already such a warning, +  but it has been tuned down just to allow this problematic idiom). +  +  A new language construct would be necessary, to ensure that the +  variable gets destructed both on normal function exit and when an +  exception is thrown. It could look something like this: +  +  void foo() { +  destruct_on_exit (Thread.MutexKey my_lock = my_mutex->lock()) { +  ... do some work which requires the lock ... +  } +  } +  +  I.e. the destruct_on_exit clause ensures that the variable(s) in +  the parentheses are destructed (regardless of the amount of refs) +  if execution passes out of the block in any way. +  +  Anyway, since implementing the micro-gc is a comparatively small +  amount of extra work, the intention is to do that first, and then +  later implement the full gc as an experimental mode so that +  performance can be compared. +  + 2. This is not a problem as long as the reason only is gc efficiency. +  It's worth noting that tricks such as "wrapper objects" still have +  some use since they lessen the load on the background cycle +  detector. +  +  It is however a problem if there are open fd's or similar things +  in the structure. It doesn't look like this is feasible to solve +  internally; such structures typically are shared data, and letting +  different threads reference shared data without locking is +  essential for multi-cpu performance. This is therefore a case that +  is probably best to solve on the pike level instead, possibly +  through pike-visible refcounting. These cases appear to be fairly +  few, at least. +  + 3. If the solution in the issue "Micro-gc" is implemented, this +  problem hardly exists at all since thread local data is refcounted +  and freed almost exactly the same way as before. +  +  Otherwise, since the gc thread operate only on the new and changed +  data, and collects newly allocated data very efficiently, it would +  keep up with a very high mutation rate. GC runs are scheduled to +  run just often enough to keep the heap size within a set limit - +  as long as the gc thread doesn't become saturated and runs +  continuously, it offloads the refcounting and freeing overhead +  from the worker threads completely. +  +  If the data mutation rate is so high that the gc thread becomes +  saturated, what would happen is that malloc calls would start to +  block when the heap limit is reached. Research shows that a +  periodic gc done right provides considerably more throughput than +  pure refcounting, so the application would still run faster +  including that blocking. +  +  The remaining concern is then that the blocking would introduce +  uneven response times - the worker threads would go very fast most +  of the time but every once in a while they could hang waiting on +  the gc thread. These hangs are (according to the research paper) +  on the order of milliseconds, but if they still are problematic +  then a crude solution would be to introduce artificial short +  sleeps in the working threads to bring down the mutation rate - +  even with those sleeps the application would probably still be +  significantly faster than the current approach. +  +  + Issue: Micro-gc +  + A way to retain the immediate-destruct (and free) semantic for thread + local things referenced only from the pike stack is to implement a + "micro-gc" that runs very quickly and is called often enough to keep + the semantic. +  + To begin with, the mark-and-sweep gc for new data (as discussed in the + issue "Garbage collector") is not implemented, and the refcounts for + thread local things are not delay-updated at all. The work of the + micro-gc then becomes to free all things in the zero-count table (ZCT) + that aren't referenced from the thread's C and pike stacks. +  + Scanning the two stacks completely in every micro-gc would be too + expensive. That is solved by partitioning the ZCT so that every pike + stack frame gets one of its own. New zero-count things are always put + in the ZCT for current topmost frame. +  + That way, the micro-gc can scan the topmost parts of the stacks (above + the last pike stack frame) for references to things in the topmost + ZCT, and when a pike stack frame is popped then the things in its ZCT + can be freed without scanning at all. This is enough to timely + destruct and free the things put on the pike stack. +  + Furthermore, since the old immediate-destruct semantics only requires + destructing before and after every pike level function call, it won't + be necessary for the micro-gc to scan the C stack at all (there's + never any part of it above the current frame, i.e. above the innermost + mega_apply, to scan). +  + Note that the above works under the assumption that new things are + only referenced from the stacks in or below the current frame. That's + not always true - code might change the stack further back to + reference new things, e.g. if a function allocates some temporary + struct on the stack and then pass the pointer to it to subroutines + that change it. +  + Such code on the C level is very unlikely, since it would mean that C + code would be changing something on the C stack back across a pike + level apply. +  + On the Pike level it can occur with inner functions changing variables + in their surrounding functions. Those cases can however be detected + and handle one way or the other. One way is to detect them at compile + time and "stay" in the frame of the outermost surrounding function for + the purposes of the micro-gc. That doesn't scale well if the inner + functions are deeply recursive, though. +  + This micro-gc approach comes at a considerable expense compared to the + solution described in the issue "Garbage collector": Not only does the + generational gc with mark-and-sweep for young data disappear (which + according to the research paper gives 15-40% more total throughput), + but the delayed updating of the refcounts disappear to a large extent + too. Refcounting from the stacks is still avoided though, and delayed + updating of refcounts in shared data is still done, which is crucial + for multi-cpu performance. +  +  + Issue: Single-refcount optimizations +  + Pre-multi-cpu Pike makes use of the refcounting to optimize + operations: Some operations that shouldn't be destructive on their + operands can be destructive anyway on an operand if it has no other + references. A common case in adding elements to arrays: +  +  array arr = ({}); +  while (...) +  arr += ({another_element}); +  + Here arr only got a single reference from the stack, so the += + operator destructively grows the array to add new elements to the end + of it. +  + With the new gc approach, such single-refcount optimizations no longer + work in general. This is the case even if the micro-gc is implemented, + since stack refs aren't counted. +  + FIXME: List cases and discuss solutions. +  +    Issue: Moving things between lock spaces      Things can be moved between lock spaces, or be made thread local or   disowned. In all these cases, one or more things are given explicitly.   It's natural if not only those things are moved, but also all other   things in the same source lock space that are referenced from the   given things and not from anywhere else (this operation is the same as   Pike.count_memory does). In the case of making things thread local or   disowned, it is also necessary to check that the explicitly given   things aren't referenced from elsewhere.
pike.git/multi-cpu.txt:507:      Issue: Types      Like strings, types are globally unique and always shared in Pike.   That means lock-free access to them is desirable, and it should also   be doable fairly easily since they are constant. Otoh it's probably   not as vital as for strings since types typically only are built   during compilation.       - Issue: Shared mapping and multiset data blocks + Issue: Mapping and multiset data blocks    - An interesting issue is if things like mapping/multiset data blocks - should be first or second class things (c.f. issue "Memory object - structure"). If they're second class it means copy-on-write behavior - doesn't work across lock spaces. If they're first class it means - additional overhead handling the lock spaces of the mapping data - blocks, and if a mapping data is shared between lock spaces then it - has to be in some third lock space of its own, or in the global lock - space, neither of which would be very good. + Mappings and multisets currently have a deferred copy-on-write + behavior, i.e. several mappings/multisets can share the same data + block and it's only copied to a local one when changed through a + specific mapping/multiset.    -  + If mappings and/or multisets are changed to be lock-free then the + copy-on-write behavior needs to be solved: +  + o A flag is added to the mapping/multiset data block that is set +  whenever it is shared. + o Every destructive operation checks the flag. If set, it makes a +  copy, otherwise it changes the original block. Thus the flag is +  essentially a read-only marker. + o The flag is cleared by the gc if it finds only one ref to a data +  block. (Refcounting cannot be used without locking.) + o Hazard pointers are necessary for every destructive access, +  including the setting of the flag. The reason is that the +  read-onlyness only is in effect after all currently modifying +  threads are finished with the block. The thread that is setting the +  flag therefore has to wait until there are no other hazard pointers +  to the block before returning. +  + It's a good question whether keeping the copy-on-write feature is + worth this overhead. Of course, an alternative is to simply let the + builtin mappings and/or multisets be locking, and instead have special + objects that implements lock-free data types. +  + Another issue is if things like mapping/multiset data blocks should be + first or second class things (c.f. issue "Memory object structure"). + If they're second class it means copy-on-write behavior doesn't work + across lock spaces. If they're first class it means additional + overhead handling the lock spaces of the mapping data blocks, and if a + mapping data is shared between lock spaces then it has to be in some + third lock space of its own, or in the global lock space, neither of + which would be very good. +    So it doesn't look like there's a better way than to botch   copy-on-write in this case.         Issue: Emulating the interpreter lock      For compatibility with old C modules, and for the _disable_threads   function, it is necessary to retain a complete lock like the current   interpretator lock. It has to lock the global area for writing, and   also stop all access to all lock spaces, since the thread local data
pike.git/multi-cpu.txt:573:      Issue: Exceptions      "Forgotten" locks after exceptions shouldn't be a problem: Explicit   locks are handled just like today (i.e. it's up to the pike   programmer), and implicit locks can safely be released when an   exception is thrown.      One case requires attention: An old-style function that requires the   compat interpreter lock might catch an error. In that case the error - system has to ensure that lock is reacquired. + system has to ensure that lock is reacquired. This is however only a + problem if C level module compatibility is kept as an option, which + currently appears to be unlikely with the proposed gc (see issue + "Garbage collector", item c).         Issue: C module interface      A new add_function variant is probably added for new-style functions.   It takes bits for the flags discussed for issue "Function calls".   New-style functions can only assume free access to the current storage   according to those flags; everything else must be locked (through a   new set of macros/functions).   
pike.git/multi-cpu.txt:595:   mapping_lookup, and object_index_no_free) handles the necessary   locking internally. They will only assume that the thing is safe, i.e.   that the caller ensures the current thread controls at least one ref.      THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in   new-style functions.      There will be new GC callbacks for walking module global pointers to   things (see issue "Garbage collection and external references").    + The proposed gc requires that every pointer change in a (heap + allocated) thing is tracked (for pointers that might point to other + heap allocated things). This is because the gc has to log the old + state of the pointers before the first change after a gc run (see + issue "Garbage collector", item c). For all builtin data types, this + is handled internally in primitives like mapping_insert and + object_set_index, so the only cases that the C module code typically + has to handle are direct updates in the current storage. Therefore all + pointer changes that currently looks someting like    -  +  THIS->my_thing = some_thing; +  + must be wrapped in some kind of macro/function call to become: +  +  set_ptr (THIS, my_thing, some_thing); +  + On the positive side, all the refcount twiddling to account for + references from the C and pike stacks can be removed from the C code. + That also includes a lot of the SET_ONERROR stuff which currently is + necessary to avoid lost refs when errors are thrown. +  +    Issue: C module compatibility    -  + Currently it doesn't look like the goal to keep a source-level + compatibility mode for C modules can be achieved. The problem is that + every pointer assignment in every heap allocated thing must be wrapped + inside a macro/function call to make the new gc work (see issue + "Garbage collector", item c), and lots of C module code change such + pointers directly through plain assignments. +    Ref issue "Emulating the interpreter lock".      Ref issue "Garbage collection and external references".         Issue: Garbage collection and external references      The current gc design is that there is an initial "check" pass that   determines external references by counting all internal references,   and then for each thing subtract it from its refcount. If the result
pike.git/multi-cpu.txt:631:   callbacks that keep track of them. The gc instead has to scan the C   stacks for the threads and treat any aligned machine word containing   an apparently valid pointer to a gc candidate thing as an external   reference. This is the common approach used by standalone gc libraries   that don't require application support. For reference, here is one   such garbage collector, written in C++:   http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/Introduction.html#//apple_ref/doc/uid/TP40002427   Its source is here:   http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/    - The same approach is also necessary to cope with old C modules (see - issue "C module compatibility"), but since global C level pointers are - few, it might not be mandatory to get this working. + The same approach would also be necessary to cope with old C modules + (see issue "C module compatibility"), but since global C level + pointers are few, it might not be mandatory to get this working. And + besides, it appears unlikely that compatibility with old C modules can + be kept.         Issue: Global pike level caches    -  + Global caches that are shared between threads are common, and in + almost all cases such caches are implemented using mappings. There's + therefore a need for (at least) a hash table data type that handle + concurrent access and high mutation rates very efficiently.    -  + Issue "Lock-free hash table" discusses such a solution. It's currently + not clear whether the builtin mappings will be lock-free or not (c.f. + the copy-on-write problem in issue "Mapping and multiset data + blocks"), but if they're not then a mapping-like object class is + implemented that is lock-free. It's easy to replace global cache + mappings with such objects. +  +    Issue: Thread.Queue      A lock-free implementation should be used. The things in the queue are   typically disowned to allow them to become thread local in the reading   thread.       -  + Issue: "Relying on the interpreter lock" +  +    Issue: False sharing      False sharing occurs when thread local things used frequently by   different threads are next to each other so that they share the same   cache line. Thus the cpu caches might force frequent resynchronization   of the cache line even though there is no apparent hotspot problem on   the C level.      This can be a problem in particular for all the block_alloc pools   containing small structs. Using thread local pools is seldom a
pike.git/multi-cpu.txt:779:   shouldn't be a problem. The java implementation   (http://sourceforge.net/projects/high-scale-lib) is Public Domain. In   the comments there is talk about efforts to make a C version.      It supports (through putIfAbsent) the uniqueness requirement for   strings, i.e. if several threads try to add the same string (at   different addresses) then all will end up with the same string pointer   afterwards.      The java implementation relies on the gc to free up the old hash - tables after resize. We don't have that convenience, but the problem - is still solvable; see issue "Hazard pointers". + tables after resize. The proposed gc (issue "Garbage collector") would + solve it for us too, but even without that the problem is still + solvable - see issue "Hazard pointers".         Issue: Hazard pointers      A problem with most lock-free algorithms is how to know no other   thread is accessing a block that is about to be freed. Another is the   ABA problem which can occur when a block is freed and immediately   allocated again (common for block_alloc).      Hazard pointers are a good way to solve these problems without leaving
pike.git/multi-cpu.txt:852:      Some low-level primitives, such as CAS and fences, are necessary to   build the various lock-free tools. A third-party library would be   useful.      An effort to make a standardized library is here:   http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html (C   level interface at the end). It apparently lacks implementation,   though.    + The linux kernel is reported to contain a good abstraction lib for + these primitives, along with implementations for a large set of + architectures (see + http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08.21a.pdf). + Can we use it? (Check GPL contamination.) +    Required operations:      CAS(address, old_value, new_value)    Compare-and-set: Atomically sets *address to new_value iff its    current value is old_value. Needed for 32-bit variables, and on    64-bit systems also for 64-bit variables.      ATOMIC_INC(address)   ATOMIC_DEC(address)    Increments/decrements *address atomically. Can be simulated with
pike.git/multi-cpu.txt:897:   FIXME: More..      Survey of platform support:      o Windows/Visual Studio: Got "Interlocked Variable Access":    http://msdn.microsoft.com/en-us/library/ms684122.aspx      o FIXME: More..       + Issue: Preemptive thread suspension +  + The proposed gc should preferably be able to suspend other threads + preemptively (see issue "Garbage collector", item g). Survey of + platform support for this: +  + o POSIX threads: No support. Deprecated and removed from the standard +  since it can very easily lead to deadlocks. On some systems there +  might still be a pthread_suspend function. +  + o Windows: SuspendThread and ResumeThread exists but are only +  intended for use by debuggers. +  + It's clear that a nonpreemptive fallback is required. +  + Regardless of method, it's vital that the gc thread does not hold any + mutex, and that it takes care to avoid being stopped while it suspends + another thread. This is more important if a preemptive method is used. +  +    Issue: OpenMP      OpenMP (see www.openmp.org) is a system to parallelize code using   pragmas that are inserted into the code blocks. It can be used to   easily parallelize otherwise serial internal algorithms like searching   and all sorts of loops over arrays etc. Thus it addresses a different   problem than the high-level parallelizing architecture above, but it   might provide significant improvements nevertheless.      It's therefore worthwhile to look into how this can be deployed in the
pike.git/multi-cpu.txt:929:      FIXME: Survey platform-specific limitations.         Various links      Pragmatic nonblocking synchronization for real-time systems    http://www.usenix.org/publications/library/proceedings/usenix01/full_papers/hohmuth/hohmuth_html/index.html   DCAS is not a silver bullet for nonblocking algorithm design    http://portal.acm.org/citation.cfm?id=1007945 + A simple and efficient memory model for weakly-ordered architectures +  http://www.open-std.org/Jtc1/sc22/WG21/docs/papers/2007/n2237.pdf