1
  
2
  
3
  
4
  
5
  
6
  
7
  
8
  
9
  
10
  
11
  
12
  
13
  
14
  
15
  
16
  
17
  
18
  
19
  
20
  
21
  
22
  
23
  
24
  
25
  
26
  
27
  
28
  
29
  
30
  
31
  
32
  
33
  
34
  
35
  
36
  
37
  
38
  
39
  
40
  
41
  
42
  
43
  
44
  
45
  
46
  
47
  
48
  
49
  
50
  
51
  
52
  
53
  
54
  
55
  
56
  
57
  
58
  
59
  
60
  
61
  
62
  
63
  
64
  
65
  
66
  
67
  
68
  
69
  
70
  
71
  
72
  
73
  
74
  
75
  
76
  
77
  
78
  
79
  
80
  
81
  
82
  
83
  
84
  
85
  
86
  
87
  
88
  
89
  
90
  
91
  
92
  
93
  
94
  
95
  
96
  
97
  
98
  
99
  
100
  
101
  
102
  
103
  
104
  
105
  
106
  
107
  
108
  
109
  
110
  
111
  
112
  
113
  
114
  
115
  
116
  
117
  
118
  
119
  
120
  
121
  
122
  
123
  
124
  
125
  
126
  
127
  
128
  
129
  
130
  
131
  
132
  
133
  
134
  
135
  
136
  
137
  
138
  
139
  
140
  
141
  
142
  
143
  
144
  
145
  
146
  
147
  
148
  
149
  
150
  
151
  
152
  
153
  
154
  
155
  
156
  
157
  
158
  
159
  
160
  
161
  
162
  
163
  
164
  
165
  
166
  
167
  
168
  
169
  
170
  
171
  
172
  
173
  
174
  
175
  
176
  
177
  
178
  
179
  
180
  
181
  
182
  
183
  
184
  
185
  
186
  
187
  
188
  
189
  
190
  
191
  
192
  
193
  
194
  
195
  
196
  
197
  
198
  
199
  
200
  
201
  
202
  
203
  
204
  
205
  
206
  
207
  
208
  
209
  
210
  
211
  
212
  
213
  
214
  
215
  
216
  
217
  
218
  
219
  
220
  
221
  
222
  
223
  
224
  
225
  
226
  
227
  
228
  
229
  
230
  
231
  
232
  
233
  
234
  
235
  
236
  
237
  
238
  
239
  
240
  
241
  
242
  
243
  
244
  
245
  
246
  
247
  
248
  
249
  
250
  
251
  
252
  
253
  
254
  
255
  
256
  
257
  
258
  
259
  
260
  
261
  
262
  
263
  
264
  
265
  
266
  
267
  
268
  
269
  
270
  
271
  
272
  
273
  
274
  
275
  
276
  
277
  
278
  
279
  
280
  
281
  
282
  
283
  
284
  
285
  
286
  
287
  
288
  
289
  
290
  
291
  
292
  
293
  
294
  
295
  
296
  
297
  
298
  
299
  
300
  
301
  
302
  
303
  
304
  
305
  
306
  
307
  
308
  
309
  
310
  
311
  
312
  
313
  
314
  
315
  
316
  
317
  
318
  
319
  
320
  
321
  
322
  
323
  
324
  
325
  
326
  
327
  
328
  
329
  
330
  
331
  
332
  
333
  
334
  
335
  
336
  
337
  
338
  
339
  
340
  
341
  
342
  
343
  
344
  
345
  
346
  
347
  
348
  
349
  
350
  
351
  
352
  
353
  
354
  
355
  
356
  
357
  
358
  
359
  
360
  
361
  
362
  
363
  
364
  
365
  
366
  
367
  
368
  
369
  
370
  
371
  
372
  
373
  
374
  
375
  
376
  
377
  
378
  
379
  
380
  
381
  
382
  
383
  
384
  
385
  
386
  
387
  
388
  
389
  
390
  
391
  
392
  
393
  
394
  
395
  
396
  
397
  
398
  
399
  
400
  
401
  
402
  
403
  
404
  
405
  
406
  
407
  
408
  
409
  
410
  
411
  
412
  
413
  
414
  
415
  
416
  
417
  
418
  
419
  
420
  
421
  
422
  
423
  
424
  
425
  
426
  
427
  
428
  
429
  
430
  
431
  
432
  
433
  
434
  
435
  
436
  
437
  
438
  
439
  
440
  
441
  
442
  
443
  
444
  
445
  
446
  
447
  
448
  
449
  
450
  
451
  
452
  
453
  
454
  
455
  
456
  
457
  
458
  
459
  
460
  
461
  
462
  
463
  
464
  
465
  
466
  
467
  
468
  
469
  
470
  
471
  
472
  
473
  
474
  
475
  
476
  
477
  
478
  
479
  
480
  
481
  
482
  
483
  
484
  
485
  
486
  
487
  
488
  
489
  
490
  
491
  
492
  
493
  
494
  
495
  
496
  
497
  
498
  
499
  
500
  
501
  
502
  
503
  
504
  
505
  
506
  
507
  
508
  
509
  
510
  
511
  
512
  
513
  
514
  
515
  
516
  
517
  
518
  
519
  
520
  
521
  
522
  
523
  
524
  
525
  
526
  
527
  
528
  
529
  
530
  
531
  
532
  
533
  
534
  
535
  
536
  
537
  
538
  
539
  
540
  
541
  
542
  
543
  
544
  
545
  
546
  
547
  
548
  
549
  
550
  
551
  
552
  
553
  
554
  
555
  
556
  
557
  
558
  
559
  
560
  
561
  
562
  
563
  
564
  
565
  
566
  
567
  
568
  
569
  
570
  
571
  
572
  
573
  
574
  
575
  
576
  
577
  
578
  
579
  
580
  
581
  
582
  
583
  
584
  
585
  
586
  
587
  
588
  
589
  
590
  
591
  
592
  
593
  
594
  
595
  
596
  
597
  
598
  
599
  
600
  
601
  
602
  
603
  
604
  
605
  
606
  
607
  
608
  
609
  
610
  
611
  
612
  
613
  
614
  
615
  
616
  
617
  
618
  
619
  
620
  
621
  
622
  
623
  
624
  
625
  
626
  
627
  
628
  
629
  
630
  
631
  
632
  
633
  
634
  
635
  
636
  
637
  
638
  
639
  
640
  
641
  
642
  
643
  
644
  
645
  
646
  
647
  
648
  
649
  
650
  
651
  
652
  
653
  
654
  
655
  
656
  
657
  
658
  
659
  
660
  
661
  
662
  
663
  
664
  
665
  
666
  
667
  
668
  
669
  
670
  
671
  
672
  
673
  
674
  
675
  
676
  
677
  
678
  
679
  
680
  
681
  
682
  
683
  
684
  
685
  
686
  
687
  
688
  
689
  
690
  
691
  
692
  
693
  
694
  
695
  
696
  
697
  
698
  
699
  
700
  
701
  
702
  
703
  
704
  
705
  
706
  
707
  
708
  
709
  
710
  
711
  
712
  
713
  
714
  
715
  
716
  
717
  
718
  
719
  
720
  
721
  
722
  
723
  
724
  
725
  
726
  
727
  
728
  
729
  
730
  
731
  
732
  
733
  
734
  
735
  
736
  
737
  
738
  
739
  
740
  
741
  
742
  
743
  
744
  
745
  
746
  
747
  
748
  
749
  
750
  
751
  
752
  
753
  
754
  
755
  
756
  
757
  
758
  
759
  
760
  
761
  
762
  
763
  
764
  
765
  
766
  
767
  
768
  
769
  
770
  
771
  
772
  
773
  
774
  
775
  
776
  
777
  
778
  
779
  
780
  
781
  
782
  
783
  
784
  
785
  
786
  
787
  
788
  
789
  
790
  
791
  
792
  
793
  
794
  
795
  
796
  
797
  
798
  
799
  
800
  
801
  
802
  
803
  
804
  
805
  
806
  
807
  
808
  
809
  
810
  
811
  
812
  
813
  
814
  
815
  
816
  
817
  
818
  
819
  
820
  
821
  
822
  
823
  
824
  
825
  
826
  
827
  
828
  
829
  
830
  
831
  
832
  
833
  
834
  
835
  
836
  
837
  
838
  
839
  
840
  
841
  
842
  
843
  
844
  
845
  
846
  
847
  
848
  
849
  
850
  
851
  
852
  
853
  
854
  
855
  
856
  
857
  
858
  
859
  
860
  
861
  
862
  
863
  
864
  
865
  
866
  
867
  
868
  
869
  
870
  
871
  
872
  
873
  
874
  
875
  
876
  
877
  
878
  
879
  
880
  
881
  
882
  
883
  
884
  
885
  
886
  
887
  
888
  
889
  
890
  
891
  
892
  
893
  
894
  
895
  
896
  
897
  
898
  
899
  
900
  
901
  
902
  
903
  
904
  
905
  
906
  
907
  
908
  
909
  
910
  
911
  
912
  
913
  
914
  
915
  
916
  
917
  
918
  
919
  
920
  
921
  
922
  
923
  
924
  
925
  
926
  
927
  
928
  
929
  
930
  
931
  
932
  
933
  
934
  
935
  
936
  
937
  
938
  
939
  
940
  
941
  
942
  
943
  
944
  
945
  
946
  
947
  
948
  
949
  
950
  
951
  
952
  
953
  
954
  
955
  
956
  
957
  
958
  
959
  
960
  
961
  
962
  
963
  
964
  
965
  
966
  
967
  
968
  
969
  
970
  
971
  
972
  
973
  
974
  
975
  
976
  
977
  
978
  
979
  
980
  
981
  
982
  
983
  
984
  
985
  
986
  
987
  
988
  
989
  
990
  
991
  
992
  
993
  
994
  
995
  
996
  
997
  
998
  
999
  
1000
  
1001
  
1002
  
1003
  
1004
  
1005
  
1006
  
1007
  
1008
  
1009
  
1010
  
1011
  
1012
  
1013
  
1014
  
1015
  
1016
  
1017
  
1018
  
1019
  
1020
  
1021
  
1022
  
1023
  
1024
  
1025
  
1026
  
1027
  
1028
  
1029
  
1030
  
1031
  
1032
  
1033
  
1034
  
1035
  
1036
  
1037
  
1038
  
1039
  
1040
  
1041
  
1042
  
1043
  
1044
  
1045
  
1046
  
1047
  
1048
  
1049
  
1050
  
1051
  
1052
  
1053
  
1054
  
1055
  
1056
  
1057
  
1058
  
1059
  
1060
  
1061
  
1062
  
1063
  
1064
  
1065
  
1066
  
1067
  
1068
  
1069
  
1070
  
1071
  
1072
  
1073
  
1074
  
1075
  
1076
  
1077
  
1078
  
1079
  
1080
  
1081
  
1082
  
1083
  
1084
  
1085
  
1086
  
1087
  
1088
  
1089
  
1090
  
1091
  
1092
  
1093
  
1094
  
1095
  
1096
  
1097
  
1098
  
1099
  
1100
  
1101
  
1102
  
1103
  
1104
  
1105
  
1106
  
1107
  
1108
  
1109
  
1110
  
1111
  
1112
  
1113
  
1114
  
1115
  
1116
  
1117
  
1118
  
1119
  
1120
  
1121
  
1122
  
1123
  
1124
  
1125
  
1126
  
1127
  
1128
  
1129
  
1130
  
1131
  
1132
  
1133
  
1134
  
1135
  
1136
  
1137
  
1138
  
1139
  
1140
  
1141
  
1142
  
1143
  
1144
  
1145
  
1146
  
1147
  
1148
  
1149
  
1150
  
1151
  
1152
  
1153
  
1154
  
1155
  
1156
  
1157
  
1158
  
1159
  
1160
  
1161
  
1162
  
1163
  
1164
  
1165
  
1166
  
1167
  
1168
  
1169
  
1170
  
1171
  
1172
  
1173
  
1174
  
1175
  
1176
  
1177
  
1178
  
1179
  
1180
  
1181
  
1182
  
1183
  
1184
  
1185
  
1186
  
1187
  
1188
  
1189
  
1190
  
1191
  
1192
  
1193
  
1194
  
1195
  
1196
  
1197
  
1198
  
1199
  
1200
  
1201
  
1202
  
1203
  
1204
  
1205
  
1206
  
1207
  
1208
  
1209
  
1210
  
1211
  
1212
  
1213
  
1214
  
1215
  
1216
  
1217
  
1218
  
1219
  
1220
  
1221
  
1222
  
1223
  
1224
  
1225
  
1226
  
1227
  
1228
  
1229
  
1230
  
1231
  
1232
  
1233
  
1234
  
1235
  
1236
  
1237
  
1238
  
1239
  
1240
  
1241
  
1242
  
1243
  
1244
  
1245
  
1246
  
1247
  
1248
  
1249
  
1250
  
1251
  
1252
  
1253
  
1254
  
1255
  
1256
  
1257
  
1258
  
1259
  
1260
  
1261
  
1262
  
1263
  
1264
  
1265
  
1266
  
1267
  
1268
  
1269
  
1270
  
1271
  
1272
  
1273
  
1274
  
1275
  
1276
  
1277
  
1278
  
1279
  
1280
  
1281
  
1282
  
1283
  
1284
  
1285
  
1286
  
1287
  
1288
  
1289
  
1290
  
1291
  
1292
  
1293
  
1294
  
1295
  
1296
  
1297
  
1298
  
1299
  
1300
  
1301
  
1302
  
1303
  
1304
  
1305
  
1306
  
1307
  
1308
  
1309
  
1310
  
1311
  
1312
  
1313
  
1314
  
1315
  
1316
  
1317
  
1318
  
1319
  
1320
  
1321
  
1322
  
1323
  
1324
  
1325
  
1326
  
1327
  
1328
  
1329
  
1330
  
1331
  
1332
  
1333
  
1334
  
1335
  
1336
  
1337
  
1338
  
1339
  
1340
  
1341
  
1342
  
1343
  
1344
  
1345
  
1346
  
1347
  
1348
  
1349
  
1350
  
1351
  
1352
  
1353
  
1354
  
1355
  
1356
  
1357
  
1358
  
1359
  
1360
  
1361
  
1362
  
1363
  
1364
  
1365
  
1366
  
1367
  
1368
  
1369
  
1370
  
Multi-cpu support in Pike 
------------------------- 
 
This is a draft spec for how to implement multi-cpu support in Pike. 
The intention is that it gets extended along the way as more issues 
gets ironed out. Discussions take place in "Pike dev" in LysKOM or 
pike-devel@lists.lysator.liu.se. 
 
Initial draft created 8 Nov 2008 by Martin Stjernholm. 
 
 
Background and goals 
 
Pike supports multiple threads, but like many other high-level 
languages it only allows one thread at a time to access the data 
structures. This means that the utilization of multi-cpu and 
multi-core systems remains low, even though there are some modules 
that can do isolated computational tasks in parallell (e.g. the Image 
module). 
 
It is the so-called "interpreter lock" that must be locked to access 
any reference variable (i.e. everything except floats and native 
integers). This lock is held by default in essentially all C code and 
is explicitly unlocked in a region by the THREADS_ALLOW/ 
THREADS_DISALLOW macros. On the pike level, the lock is always held - 
no pike variable can be accessed and no pike function can be called 
otherwise. 
 
The purpose of the multi-cpu support is to rectify this. The design 
goals are, in order of importance: 
 
1.  Pike threads should be able to execute pike code concurrently on 
    multiple cpus as long as they only modify thread local pike data 
    and read a shared pool of static data (i.e. the pike programs, 
    modules and constants). 
 
2.  There should be as few internal hot spots as possible (preferably 
    none) when pike code is executed concurrently. Care must be taken 
    to avoid internal synchronization, or updates of shared data that 
    would cause "cache line ping-pong" between cpus. 
 
3.  The concurrency should be transparent on the pike level. Pike code 
    should still be able to access shared data without locking and 
    without risking low-level inconsistencies. (So Thread.Mutex etc 
    would still be necessary to achieve higher level synchronization.) 
 
4.  There should be tools on the pike level to allow further 
    performance tuning, e.g. lock-free queues, concurrent access hash 
    tables, and the possibility to lock different regions of shared 
    data separately. These tools should be designed so that they are 
    easy to slot into existing code with few changes. 
 
5.  There should be tools to monitor and debug concurrency. It should 
    be possible to make assertions that certain objects aren't shared, 
    and that certain access patterns don't cause thread 
    synchronization. This is especially important if goal (3) is 
    realized, since the pike code by itself won't show what is shared 
    and what is thread local. 
 
6.  C modules should continue to work without source level 
    modification (but likely without allowing any kind of 
    concurrency). 
 
Note that even if goal (3) is accomplished, this is no miracle cure 
that would make all multithreaded pike programs run with optimal 
efficiency on multiple cpus. One could expect better concurrency in 
old code without adaptions, but it could still be hampered 
considerably by e.g. frequent updates to shared data. Concurrency is a 
problem that must be taken into account on all levels. 
 
 
Other languages 
 
Perl: All data is thread local by default. Data can be explicitly 
shared, in which case Perl ensures internal consistency. Every shared 
variable is apparently locked individually. Referencing a thread local 
variable from a shared one causes the thread to die. See perthrtut(1). 
 
Python: Afaik it's the same state of affairs as Pike. 
 
 
Solution overview 
 
The basic approach is to divide all data into thread local and shared: 
 
o  Thread local data is everything that is accessible to one thread 
   only, i.e. there are no references to anything in it from shared 
   data or from any other thread. This is typically data that the 
   current thread has created itself and only reference from the 
   stack. The thread can access its local data without locking. 
 
o  Shared data is everything that is accessible from more than one 
   thread. Access to it is synchronized using a global read/write 
   lock, the so-called "global lock". I.e. this lock can either be 
   locked for reading by many threads, or be locked by a single thread 
   for writing. Locking the global lock for writing is the same as 
   locking the interpreter lock in current pikes. (This single lock is 
   refined later - see issue "Lock spaces".) 
 
o  There is also a special case where data can be "disowned", i.e. not 
   shared and not local in any thread. This is used in e.g. 
   Thread.Queue for the objects that are in transit between threads. 
   Disowned data cannot have arbitrary references to it - it must 
   always be under the control of some object that in some way ensures 
   consistency. (Garbage could be made disowned since it by definition 
   no longer is accessible from anywhere, but of course it is always 
   better to clean it up instead.) 
 
+--------+           +---------------------+     Direct    +--------+ 
|        |<-- refs --| Thread 1 local data |<- - access - -|        | 
|        |           +---------------------+               | Thread | 
|        |                                                 |    1   | 
|        |<- - - - Access through global lock only  - - - -|        | 
| Shared |                                                 +--------+ 
|        | 
|  data  |           +---------------------+     Direct    +--------+ 
|        |<-- refs --| Thread 2 local data |<- - access - -|        | 
|        |           +---------------------+               | Thread | 
|        |                                                 |    2   | 
|        |<- - - - Access through global lock only  - - - -|        | 
|        |                                                 +--------+ 
+--------+                 ... etc ... 
 
The principal use case for this model is that threads can do most of 
their work with local data and read access to the shared data, and 
comparatively seldom require the global write lock to update the 
shared data. Every shared thing does not have its own lock since that 
would cause excessive lock overhead. 
 
Note that the shared data is typically the same as the data referenced 
from the common environment (i.e. the "global data"). 
 
Also note that the current object (this) always is shared in pike 
modules, so a thread cannot assume free access to it. In other pike 
classes it would often be shared too, but it is still important to 
utilize the situation when it is thread local. See issue "Function 
calls". 
 
A thread local thing, and all the things it references directly or 
indirectly, automatically becomes shared whenever it gets referenced 
from a shared thing. 
 
A shared thing never automatically becomes thread local, but there is 
a function to explicitly "take" it. It would first have to make sure 
there are no references to it from shared or other thread local things 
(c.f. issue "Moving things between lock spaces"). Thread.Queue has a 
special case so that if a thread local thing with no other refs is 
enqueued, it is disowned by the current thread, and later becomes 
thread local in the thread that dequeues it. 
 
 
Issue: Lock spaces 
 
Having a single global read/write lock for all shared data could 
become a bottleneck. Thus there is a need for shared data with locks 
separate from the global lock. Things that share a common lock is 
called a "lock space", and it is always possible to look up the lock 
that governs any given thing (see issue "Memory object structure"). 
 
A special global lock space, which corresponds to the shared data 
discussed above, is created on startup. All others have to be created 
explicitly. 
 
The intended use case for lock spaces is a "moderately large" 
collection of things: Too large and you get outlocking problems, too 
small and the lock overhead (both execution- and memorywise) gets 
prohibiting. A typical lock space could be a RAM cache consisting of a 
mapping and all its content. 
 
Many different varieties of lock space locks can be considered, e.g. a 
simple exclusive access mutex lock or a read/write lock, priority 
locks, locks that ensure fairness, etc. Therefore different (C-level) 
implementations should be allowed. 
 
One important characteristic of lock space locks is whether they are 
implicit or explicit: 
 
Implicit locks are locked internally, without intervention on the pike 
level. The lock duration is unspecified; locks are only acquired to 
ensure internal consistency. All low level data access functions check 
whether the lock space for the accessed thing is locked already. If it 
isn't then the lock is acquired automatically. All implicit locks have 
a well defined lock order (by pointer comparison), and since they only 
are taken to guarantee internal consistency, an access function can 
always release a lock to ensure correct order (see also issue "Lock 
space locking"). 
 
Explicit locks are exposed to the pike level and must be locked in a 
similar way to Thread.Mutex. If a low level data access function 
encounters an explicit lock that isn't locked, it throws an error. 
Thus it is left to the pike programmer to avoid deadlocks, but the 
pike core won't cause any by itself. Since the pike core keeps track 
which lock governs which thing it ensures that no lock violating 
access occurs, which is a valuable aid to ensure correctness. 
 
One can also consider a variant with a read/write lock space lock that 
is implicit for read but explicit for write, thus combining atomic 
pike-level updates with the convenience of implicit locking for read 
access. 
 
The scope of a lock space lock is (at least) the state inside all the 
things it contains (with a couple exceptions - see issue "Lock space 
lock semantics"), but not the set of things itself, i.e. things might 
be added to a lock space without holding a write lock. Removing a 
thing from a lock space always requires the write lock on it since 
that is necessary to ensure that a lock actually governs a thing for 
as long as it is held (regardless it's for reading or writing). 
 
See also issues "Memory object structure" and "Lock space locking" for 
more details. 
 
 
Issue: Memory object structure 
 
Of concern are the memory objects known to the gc. They are called 
"things", to avoid confusion with "objects" which are the structs for 
pike objects. 
 
There are two types of things: 
 
o  First class things with gc header and lock space pointer. Most pike 
   visible types are first class things. The exceptions are ints and 
   floats, which are passed by value. 
 
o  Second class things contain only a gc header. They are similar to 
   first class except that their lock spaces are implicit from the 
   referencing things, which means all those referencing things must 
   always be in the same lock space. 
 
Thread local things could have NULL as lock space pointer, but as a 
debug measure they could also point to the thread object so that it's 
possible to detect bugs with a thread accessing things local to 
another thread. 
 
Before the multi-cpu architecture, there are global double-linked 
lists for each referenced pike type: array, mapping, multiset, object, 
and program (strings and types are handled differently). Thanks to the 
new gc, the double-linked lists aren't needed at all anymore. 
 
            +----------+                      +----------+ 
            | Thread 1 |                      | Thread 2 | 
           .+----------+.                    .+----------+. 
          :   refs   O   :                  :   O   O      : 
     ,----- O <--> O      :          ,------- O         O ------. 
     |    :     O     O -----.       |      :      O  O    :    | 
     |     :............:    |       |       :............:     | 
 ref |                       | ref   | ref                      | ref 
     |                       |       |                          | 
    .|..............       ..v.......v.....  refs ..............|. 
   : |     refs     : ref :  O       O   O <------> O       O   v : 
  :  v  O <---> O ------------> O      O    :   :      O        O  : 
   : O    O    O  O :     : O       O   O  :     : O       O  O   : 
    +--------------+       +--------------+       +--------------+ 
    | Lock space 1 |       | Lock space 2 |       | Lock space 3 | 
    +--------------+       +--------------+       +--------------+ 
 
This figure tries to show some threads and lock spaces, and their 
associated things as O's inside the dotted areas. Some examples of 
possible references between things are included: Thread local things 
can only reference things belonging to the same thread or things in 
any lock space, while things in lock spaces can reference things in 
the same or other lock spaces. There can be cyclic structures that 
span lock spaces. 
 
The lock space lock structs are tracked by the gc just like anything 
else, and they are therefore garbage collected when they become empty 
and unreferenced. The gc won't free a lock space lock struct that is 
locked since it always got at least one reference from the array of 
locked locks that each thread maintains (c.f. issue "Lock space 
locking"). 
 
 
Issue: Lock space lock semantics 
 
There are three types of locks: 
 
o  A read-safe lock ensures only that the data is consistent, not that 
   it stays constant. This allows lock-free updates in things where 
   possible (which could include arrays, mappings, and maybe even 
   multisets and objects of selected classes). 
 
o  A read-constant lock ensures both consistency and constantness 
   (i.e. what usually is assumed for a read-only lock). 
 
o  A write lock ensures complete exclusive access. The owning thread 
   can modify the data, and it can assume no other changes occur to it 
   (barring refcounters and lock space pointers - see below), although 
   that assumption has to be "weak" since there are a few situations 
   when another thread can intervene - see issue "Emulating the 
   interpreter lock". 
 
   The owning thread can also under limited time leave the data in 
   inconsistent state. This is however still limited by the calls to 
   check_threads(), which means that the state must be consistent 
   again every time the evaluator callbacks are run. The reason is the 
   same one as above. 
 
Allowing lock-free updates is attractive, so the standard read/write 
lock that governs the global lock space will probably be multiple 
read-safe/single write. 
 
The lock space lock covers all the data in the thing, with two 
exceptions: 
 
o  The refcounter (and other gc-related flags and fields) can always 
   change concurrently since the gc runs in a thread of its own, and 
   it doesn't heed any locks - see issue "Garbage collector". 
 
   A ref to a thing can always be added or removed, even if another 
   thread holds an exclusive write lock on it. That since the thing 
   will only be freed by the gc, which won't free it if a ref is 
   added. 
 
   Refcount updates need to be atomic if the refcounts are to be used 
   at all from other threads. Even so, they can only be used 
   opportunistically since they (almost) always might change 
   asynchronously. That could still be good enough for e.g. 
   Pike.count_memory (noone could expect it to be accurate anyway if 
   another thread is modifying the data structure being measured). 
 
o  The lock space pointer itself must at all times be either NULL or 
   point to a valid lock space struct, since another thread need to 
   access it to tell whether access to the thing is permissible. A 
   write lock is required to change the lock space pointer, but even 
   so the update must be atomic. 
 
   Since the lock space lock structs are collected by the gc, there is 
   no risk for races when threads asynchronously dereference lock 
   space pointers. 
 
FIXME: What about concurrent gc access to follow pointers? 
 
 
Issue: Lock space locking 
 
This is the locking procedure to access a thing: 
 
1.  Read the lock space pointer. If it's NULL then the thing is thread 
    local and nothing more needs to be done. 
2.  Address an array containing the pointers to the lock spaces that 
    are already locked by the thread. 
3.  Search for the lock space pointer in the array. If present then 
    nothing more needs to be done. 
4.  Lock the lock space lock as appropriate. Note that this can imply 
    other implicit locks that are held are unlocked to ensure correct 
    lock order (see issue "Lock spaces"). Then it's added to the 
    array. 
 
A thread typically won't hold more than a few locks at any time (less 
than ten or so), so a plain array and linear search should perform 
well. For quickest possible access the array should be a static thread 
local variable (c.f. issue "Thread local storage"). If the array gets 
full, implicit locks in it can be released automatically to make 
space. Still, a system where more arrays can be allocated and chained 
on would perhaps be prudent to avoid the theoretical possibility of 
running out of space for locked locks. 
 
Since implicit locks can be released (almost) at will, they are open 
for performance tuning: Too long lock durations and they'll outlock 
other threads, too short and the locking overhead becomes more 
significant. As a starting point, it seems reasonable to release them 
at every evaluator callback call (i.e. at approximately every pike 
function call and return). 
 
 
Issue: Garbage collector 
 
Pike has used refcounting to collect noncyclic structures, combined 
with a stop-the-world periodical collector for cyclic structures. The 
periodic pauses are already a problem, and it only gets worse as the 
heap size and number of concurrent threads increase. Since the gc 
needs an overhaul anyway, it makes sense to replace it with a more 
modern solution. 
 
http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2006/PHD/PHD-2006-10.ps 
is a recent thesis work that combines several state-of-the-art gc 
algorithms to an efficient whole. A brief overview of the highlights: 
 
o  The reference counts aren't updated for references on the stack. 
   The stacks are scanned when the gc runs instead. This saves a great 
   deal of refcount updates, and it also simplifies C level 
   programming a lot. Only refcounts between things on the heap are 
   counted. 
 
o  The refcounts are only updated when the gc runs. This saves a lot 
   of the remaining updates since if a pointer starts with value p_0 
   and then changes to p_1, then p_2, p_3, ..., and lastly to p_n at 
   the next gc, then only p_0->refs needs to be decremented and 
   p_n->refs needs to be incremented - the changes in all the other 
   refcounts, for the things pointed to in between, cancel out. 
 
o  The above is accomplished by thread local logging, to make the old 
   p_0 value available to the gc at the next run. This means it scales 
   well with many cpu's. 
 
o  A generational gc uses refcounting only for old things in the heap. 
   New things, which are typically very short-lived, aren't refcounted 
   at all but instead gc'ed using a mark-and-sweep collector. This is 
   shown to be more efficient for short-lived data, and it handles 
   cyclic structures without any extra effort. 
 
o  By using refcounting on old data, the gc only need to give 
   attention to refcounts that gets down to zero. This means the heap 
   can scale to any size without affecting the gc run time, as opposed 
   to using a mark-and-sweep collector on the whole heap. Thus the gc 
   time scales only with the amount of _change_ in the heap. 
 
o  Cyclic structures in the old refcounted data is handled 
   incrementally using the fact that a cyclic structure can only occur 
   when a refcounter is decremented to a value greater than zero. 
   Those things can therefore be tracked and cycle checked in the 
   background. The gc uses several different methods to weed out false 
   alarms before doing actual cycle checks. 
 
o  The gc runs entirely in its own thread. It only needs to stop the 
   working threads for a very short time to scan stacks etc, and they 
   can be stopped one at a time. 
 
Effects of using this in Pike: 
 
a.  References from the C or pike stacks don't need any handling at 
    all (see also issue "Garbage collection and external references"). 
 
b.  A significant complication in various lock-free algorithms is the 
    safe freeing of old blocks (see e.g. issue "Lock-free hash 
    table"). This gc would solve almost all such problems in a 
    convenient way. 
 
c.  Special code is used to update refs in the heap. During certain 
    circumstances, before changing a pointer inside a thing which can 
    point to another thing, the state of all non-NULL pointers in it 
    are copied to a thread local log. 
 
    This is mostly problematic since it requires that every pointer 
    assignment inside a thing is replaced with a macro or function 
    call, which has a big impact on C code. See issue "C module 
    interface". 
 
d.  A new log_pointer field is required per thing. If a state copy has 
    taken place as described above, it points to the log that contains 
    the original pointer state of the thing. 
 
    Data containers that can be of arbitrary size (i.e. arrays, 
    mappings and multisets) should be segmented into fixed-sized 
    chunks with one log_pointer each, so that the state copy doesn't 
    get arbitrarily large. 
 
e.  The double-linked lists aren't needed. Hence two pointers less per 
    thing. 
 
f.  The refcounter word is changed to hold both normal refcount, weak 
    count, and flags. Overflowed counts are stored in a separate hash 
    table. 
 
g.  The collector typically runs concurrently with the rest of the 
    program, but there are some situations when it has to synchronize 
    with them (aka handshake). In the research paper this is done by 
    letting the gc thread suspend and resume the other threads (one at 
    a time). Since preemptive suspend and resume operations are 
    generally unsupported in thread libraries (c.f. issue "Preemptive 
    thread suspension"), a cooperative approach is necessary: 
 
    The gc thread sets a state flag that all other threads need a 
    handshake. Threads that are running do the handshake work 
    themselves before waiting on a mutex or in the next evaluator 
    callback call, and the gc thread handles the threads that are 
    currently waiting (ensuring that they don't start in the 
    meantime). 
 
    The work that needs to be done during a handshake is to set some 
    flags and record some local thread state for use by the gc thread. 
    This can be done concurrently in several threads, so no locking is 
    necessary. 
 
    Due to this interaction with the other threads, it's vital that 
    the gc thread does not hold any mutex, and that it takes care to 
    avoid being stopped (e.g. through an interrupt) while it works on 
    behalf on another thread. 
 
h.  All garbage collection, both for noncyclic and cyclic garbage, are 
    discovered and handled by the gc thread. The other threads never 
    frees any block known to the gc. 
 
i.  An effect of the above is that all garbage is discovered by a 
    separate collector thread which doesn't execute any other pike 
    code. This opens up the issue on how to call destruct functions. 
 
    At least thread local things should reasonably get their destruct 
    calls in that thread. A problem is however what to do when that 
    thread has exited or emigrated (see issue "Foreign thread 
    visits"). 
 
    For shared things it's not clear which thread should call destruct 
    anyway, so in that case any thread could do it. It might however 
    be a good idea to not do it directly in the gc thread, since doing 
    so would require that thread too to be a proper pike thread with 
    pike stack etc; it seems better to keep it an "invisible" 
    low-level thread outside the "worker" threads. In programs with a 
    "backend thread" it could be useful to allow the gc thread wake up 
    the backend thread to let it execute the destruct calls. 
 
j.  The most bothersome problem is that things are no longer freed 
    right away when running out of refs. See issue "Immediate 
    destruct/free when refcount reaches zero". 
 
k.  Weak refs are handled with a separate refcount in each thing. That 
    means things have two refcounts: One for weak refs and another for 
    all refs. See also issue "Weak ref garbage collection". 
 
l.  One might consider separating the refcounts from the things by 
    using a hash table. This makes sense when considering that only 
    the collector thread is using the refcounts, thereby avoiding 
    false aliasing occurring from refcounter updates (and other gc 
    related flags) by that thread. 
 
    All the hash table lookups would however incur a significant 
    overhead in the gc thread. A better alternative would be to use a 
    bitmap based on the possible allocation slots used by the malloc 
    implementation, but that would require very tight integration with 
    the malloc system. The bitmap could work with only two bits per 
    refcounter - research shows that most objects in a refcounted heap 
    have very few refs. Overflowing (a.k.a. "stuck") refcounters at 3 
    would then be stored in a hash table. 
 
To simplify memory handling, the gc should be used consistently on all 
heap structs, regardless whether they are pike visible things or not. 
An interesting question is whether the type info for every struct 
(more concretely, the address of some area where the gc can find the 
functions it needs to handle the struct) is carried in the struct 
itself (through a new pointer field), or if it continues to be carried 
in the context for every pointer to the struct (e.g. in the type field 
in svalues). 
 
Since the gc would be used for most internal structs as well, which 
are almost exclusively used via compile-time typed pointers, it would 
probably save significant heap space to retain the type in the pointer 
context. It does otoh complicate the gc - everywhere where the gc is 
fed a pointer to a thing, it must also be fed a type info pointer, and 
the gc must then keep track of this data tuple internally. 
 
 
Issue: Immediate destruct/free when refcount reaches zero 
 
When a thing in Pike runs out of references, it's destructed and freed 
almost immediately in the pre-multi-cpu implementation. This behavior 
in Pike is used implicitly in many places. The major (hopefully all) 
principal use cases of concern are: 
 
1.  It's popular to make code that releases a lock timely by just 
    storing it in a local variable that gets freed when the function 
    exits (either by normal return or by exception). E.g: 
 
      void foo() { 
        Thread.MutexKey my_lock = my_mutex->lock(); 
        ... do some work ... 
        // my_lock falls out of scope here when the function exits 
        // (also if it's due to a thrown exception), so the lock is 
        // released right away. 
      } 
 
    There's also code that opens files and sockets etc, and expects 
    them to be automatically closed again through this method. (That 
    practice has been shown to be bug prone, though, so in the sources 
    at Roxen many of those places have been fixed over time.) 
 
2.  In some cases, structures are carefully kept acyclic to make them 
    get freed quickly, and there is no control of which party that got 
    the "last reference". 
 
    One example is if a cache holds one ref to an entry, and there 
    might at the same time be one or more worker threads that hold 
    references to the same entry while they use it. In this case the 
    cache can be pruned safely by dropping the reference to the entry, 
    without destructing it. 
 
    A variant when the structure cannot be made acyclic is to make a 
    "wrapper object": It holds a reference to the cyclic structure, 
    and all other parties makes sure to hold a ref to the wrapper as 
    long as they got interest in any part of the data. When the 
    wrapper runs out of refs, it destructs the cyclic structure 
    explicitly. 
 
    These tricks have mostly been used to reduce the amount of cyclic 
    garbage that requires the stop-the-world gc to run more often, but 
    there are also occasions when the structure holds open fd's which 
    must be closed without delay (one such occasion is the connection 
    fd in the http protocol in the Roxen WebServer). 
 
3.  In some applications with extremely high data mutation rate, the 
    immediate freeing of acyclic structures is seen as a prerequisite 
    to keep bounds on memory consumption. 
 
4.  FIXME: Are there more? 
 
The proposed gc (c.f. issue "Garbage collector") does not retain the 
immediate destruct and free semantic - only the gc running in its own 
thread may free things. Although it would run much more often than the 
old gc (probably on the order of once a minute up to several times a 
second), it would still break this semantic. To discuss each use case 
above: 
 
1.  Locks, and in some cases also open fd's, cannot wait until the 
    next gc run. 
 
    Observing that mutex locks always are thread local things, almost 
    all these cases (exceptions are possibly fd objects that somehow 
    are shared anyway) can be solved by a modified gc approach - see 
    issue "Micro-gc". 
 
    Since the micro-gc approach appears to be expensive, it's worth 
    considering to actually ditch this behavior and solve the problem 
    on the pike level instead. The compiler can be used to detect many 
    of these cases by looking for assignments to local variables that 
    aren't accessed from anywhere (there is already such a warning, 
    but it has been tuned down just to allow this problematic idiom). 
 
    A new language construct would be necessary, to ensure that the 
    variable gets destructed both on normal function exit and when an 
    exception is thrown. It could look something like this: 
 
      void foo() { 
        destruct_on_exit (Thread.MutexKey my_lock = my_mutex->lock()) { 
          ... do some work which requires the lock ... 
        } 
      } 
 
    I.e. the destruct_on_exit clause ensures that the variable(s) in 
    the parentheses are destructed (regardless of the amount of refs) 
    if execution passes out of the block in any way. 
 
    Anyway, since implementing the micro-gc is a comparatively small 
    amount of extra work, the intention is to do that first, and then 
    later implement the full gc as an experimental mode so that 
    performance can be compared. 
 
2.  This is not a problem as long as the reason only is gc efficiency. 
    It's worth noting that tricks such as "wrapper objects" still have 
    some use since they lessen the load on the background cycle 
    detector. 
 
    It is however a problem if there are open fd's or similar things 
    in the structure. It doesn't look like this is feasible to solve 
    internally; such structures typically are shared data, and letting 
    different threads reference shared data without locking is 
    essential for multi-cpu performance. This is therefore a case that 
    is probably best to solve on the pike level instead, possibly 
    through pike-visible refcounting. These cases appear to be fairly 
    few, at least. 
 
3.  If the solution in the issue "Micro-gc" is implemented, this 
    problem hardly exists at all since thread local data is refcounted 
    and freed almost exactly the same way as before. 
 
    Otherwise, since the gc thread operate only on the new and changed 
    data, and collects newly allocated data very efficiently, it would 
    keep up with a very high mutation rate. GC runs are scheduled to 
    run just often enough to keep the heap size within a set limit - 
    as long as the gc thread doesn't become saturated and runs 
    continuously, it offloads the refcounting and freeing overhead 
    from the worker threads completely. 
 
    If the data mutation rate is so high that the gc thread becomes 
    saturated, what would happen is that malloc calls would start to 
    block when the heap limit is reached. Research shows that a 
    periodic gc done right provides considerably more throughput than 
    pure refcounting, so the application would still run faster 
    including that blocking. 
 
    The remaining concern is then that the blocking would introduce 
    uneven response times - the worker threads would go very fast most 
    of the time but every once in a while they could hang waiting on 
    the gc thread. These hangs are (according to the research paper) 
    on the order of milliseconds, but if they still are problematic 
    then a crude solution would be to introduce artificial short 
    sleeps in the working threads to bring down the mutation rate - 
    even with those sleeps the application would probably still be 
    significantly faster than the current approach. 
 
 
Issue: Micro-gc 
 
A way to retain the immediate-destruct (and free) semantic for thread 
local things referenced only from the pike stack is to implement a 
"micro-gc" that runs very quickly and is called often enough to keep 
the semantic. 
 
To begin with, the mark-and-sweep gc for new data (as discussed in the 
issue "Garbage collector") is not implemented, and the refcounts for 
thread local things are not delay-updated at all. The work of the 
micro-gc then becomes to free all things in the zero-count table (ZCT) 
that aren't referenced from the thread's C and pike stacks. 
 
Scanning the two stacks completely in every micro-gc would be too 
expensive. That is solved by partitioning the ZCT so that every pike 
stack frame gets one of its own. New zero-count things are always put 
in the ZCT for current topmost frame. 
 
That way, the micro-gc can scan the topmost parts of the stacks (above 
the last pike stack frame) for references to things in the topmost 
ZCT, and when a pike stack frame is popped then the things in its ZCT 
can be freed without scanning at all. This is enough to timely 
destruct and free the things put on the pike stack. 
 
Furthermore, since the old immediate-destruct semantics only requires 
destructing before and after every pike level function call, it won't 
be necessary for the micro-gc to scan the C stack at all (there's 
never any part of it above the current frame, i.e. above the innermost 
mega_apply, to scan). 
 
Note that the above works under the assumption that new things are 
only referenced from the stacks in or below the current frame. That's 
not always true - code might change the stack further back to 
reference new things, e.g. if a function allocates some temporary 
struct on the stack and then pass the pointer to it to subroutines 
that change it. 
 
Such code on the C level is very unlikely, since it would mean that C 
code would be changing something on the C stack back across a pike 
level apply. 
 
On the Pike level it can occur with inner functions changing variables 
in their surrounding functions. Those cases can however be detected 
and handle one way or the other. One way is to detect them at compile 
time and "stay" in the frame of the outermost surrounding function for 
the purposes of the micro-gc. That doesn't scale well if the inner 
functions are deeply recursive, though. 
 
This micro-gc approach comes at a considerable expense compared to the 
solution described in the issue "Garbage collector": Not only does the 
generational gc with mark-and-sweep for young data disappear (which 
according to the research paper gives 15-40% more total throughput), 
but the delayed updating of the refcounts disappear to a large extent 
too. Refcounting from the stacks is still avoided though, and delayed 
updating of refcounts in shared data is still done, which is crucial 
for multi-cpu performance. 
 
 
Issue: Single-refcount optimizations 
 
Pre-multi-cpu Pike makes use of the refcounting to optimize 
operations: Some operations that shouldn't be destructive on their 
operands can be destructive anyway on an operand if it has no other 
references. A common case in adding elements to arrays: 
 
  array arr = ({}); 
  while (...) 
    arr += ({another_element}); 
 
Here arr only got a single reference from the stack, so the += 
operator destructively grows the array to add new elements to the end 
of it. 
 
With the new gc approach, such single-refcount optimizations no longer 
work in general. This is the case even if the micro-gc is implemented, 
since stack refs aren't counted. 
 
FIXME: List cases and discuss solutions. 
 
 
Issue: Weak ref garbage collection 
 
When the two refcounters (one for total number of refs and another for 
the number of weak refs) are equal then the thing is semantically 
freed. The problem is that it still got refs which might be followed 
later, so the gc cannot free it. 
 
There are two ways to tackle this problem: 
 
One alternative is to keep track of all the weak pointers that point 
to each thing, so that they can be followed backwards and cleared when 
only weak pointers are left. That tracking requires additional data 
structures and the associated overhead, and clearing the other 
pointers might require lock space locks to be taken. 
 
Another alternative is to free all refs emanating from the thing with 
only weak pointers left, and keep it as an empty structure (a 
destructed object, an empty array/multiset/mapping, or an empty 
skeleton program which contains no identifiers). This approach 
requires a flag to recognize such semi-freed things, and that all code 
that dereference weak pointers check for it. A problem is that data 
blocks remain allocated longer than necessary, maybe even 
indefinitely. That can be mitigated to some degree by shortening them 
using realloc(3). 
 
 
Issue: Moving things between lock spaces 
 
Things can be moved between lock spaces, or be made thread local or 
disowned. In all these cases, one or more things are given explicitly. 
It's natural if not only those things are moved, but also all other 
things in the same source lock space that are referenced from the 
given things and not from anywhere else (this operation is the same as 
Pike.count_memory does). In the case of making things thread local or 
disowned, it is also necessary to check that the explicitly given 
things aren't referenced from elsewhere. 
 
FIXME: This is a problem with the proposed garbage collector (see 
issue "Garbage collector"). Old things got refcounts that can be used, 
but they might be stale, and the logging doesn't provide information 
in the form we need. New things are even worse since they got no 
refcounts at all that can be used to check for outside refs. 
Furthermore, there is a race since an external ref can be added at any 
time from any thread. 
 
All this is settled when the gc is run: If the "controlled" refs are 
temporarily ignored then the set to move is the one that would turn 
into garbage. But it is not good to either have to wait for the gc or 
run it synchronously. 
 
Also, the problem above applies to Pike.count_memory too. 
 
 
Issue: Strings 
 
Strings are unique in Pike. This property is hard to keep if threads 
have local string pools, since a thread local string might become 
shared at any moment, and thus would need to be moved. Therefore the 
string hash table remains global, and lock congestion is avoided with 
some concurrent access hash table implementation. See issue "Lock-free 
hash table". 
 
Lock-free is a good start, but the hash function must also provide a 
good even distribution to avoid hotspots. Pike currently uses an 
in-house algorithm (DO_HASHMEM in pike_memory.h). Replacing it with a 
more widespread and better studied alternative should be considered. 
There seems to be few that are below O(n) (which DO_HASHMEM is), 
though. 
 
 
Issue: Types 
 
Like strings, types are globally unique and always shared in Pike. 
That means lock-free access to them is desirable, and it should also 
be doable fairly easily since they are constant. Otoh it's probably 
not as vital as for strings since types typically only are built 
during compilation. 
 
 
Issue: Mapping and multiset data blocks 
 
Mappings and multisets currently have a deferred copy-on-write 
behavior, i.e. several mappings/multisets can share the same data 
block and it's only copied to a local one when changed through a 
specific mapping/multiset. 
 
If mappings and/or multisets are changed to be lock-free then the 
copy-on-write behavior needs to be solved: 
 
o  A flag is added to the mapping/multiset data block that is set 
   whenever it is shared. 
o  Every destructive operation checks the flag. If set, it makes a 
   copy, otherwise it changes the original block. Thus the flag is 
   essentially a read-only marker. 
o  In addition to the flag, the gc performs normal refcounting. It 
   clears the flag if the refcount is 1. (The refcount cannot be used 
   directly since it's delay-updated.) 
o  Hazard pointers are necessary for every destructive access, 
   including the setting of the flag. The reason is that the 
   read-onlyness only is in effect after all currently modifying 
   threads are finished with the block. The thread that is setting the 
   flag therefore has to wait until there are no other hazard pointers 
   to the block before returning. 
 
It's a good question whether keeping the copy-on-write feature is 
worth this overhead. Of course, an alternative is to simply let the 
builtin mappings and/or multisets be locking, and instead have special 
objects that implements lock-free data types. 
 
Another issue is if things like mapping/multiset data blocks should be 
first or second class things (c.f. issue "Memory object structure"). 
If they're second class it means copy-on-write behavior doesn't work 
across lock spaces. If they're first class it means additional 
overhead handling the lock spaces of the mapping data blocks, and if a 
mapping data is shared between lock spaces then it has to be in some 
third lock space of its own, or in the global lock space, neither of 
which would be very good. So it doesn't look like there's a better way 
than to botch copy-on-write in this case. 
 
 
Issue: Emulating the interpreter lock 
 
For compatibility with old C modules, and for the _disable_threads 
function, it is necessary to retain a complete lock like the current 
interpretator lock. It has to lock the global area for writing, and 
also stop all access to all lock spaces, since the thread local data 
might refer to any lock space. 
 
This lock is implemented as a read/write lock, which normally is held 
permanently for reading by all threads. Only when a thread is waiting 
to acquire the compat interpreter lock is it released as each thread 
goes into check_threads(). 
 
This lock cannot wait for explicit lock space locks to be released. 
Thus it can override the assumption that a lock space is safe from 
tampering by holding a write lock on it. Still, it's only available 
from the C level (with the exception of _disable_threads) so the 
situation is not any different from the way the interpreter lock 
overrides Thread.Mutex today. 
 
 
Issue: Function calls 
 
A lock on an object is almost always necessary before calling a 
function in it. Therefore the central apply function (mega_apply) must 
ensure an appropriate lock is taken. Which kind of lock 
(read-safe/read-constant/write - see issue "Lock space lock 
semantics") depends on what the function wants to do. Therefore all 
object functions are extended with flags for this. 
 
The best default is probably read-safe. Flags for no locking (for the 
few special cases where the implementations actually are completely 
lock-free) and for compat-interpreter-lock-locking should probably 
exist as well. A compat-interpreter-lock flag is also necessary for 
global functions that don't have a "this" object (aka efuns). 
 
Having the required locking declared this way also alleviates each 
function from the burden of doing the locking to access the current 
storage, and it allows future compiler optimizations to minimize lock 
operations. 
 
 
Issue: Exceptions 
 
"Forgotten" locks after exceptions shouldn't be a problem: Explicit 
locks are handled just like today (i.e. it's up to the pike 
programmer), and implicit locks can safely be released when an 
exception is thrown. 
 
One case requires attention: An old-style function that requires the 
compat interpreter lock might catch an error. In that case the error 
system has to ensure that lock is reacquired. 
 
 
Issue: C module interface 
 
A new add_function variant is probably added for new-style functions. 
It takes bits for the flags discussed for issue "Function calls". 
New-style functions can only assume free access to the current storage 
according to those flags; everything else must be locked (through a 
new set of macros/functions). 
 
Accessor functions for data types (e.g. add_shared_strings, 
mapping_lookup, and object_index_no_free) handles the necessary 
locking internally. They will only assume that the thing is safe, i.e. 
that the caller ensures the current thread controls at least one ref. 
 
THREADS_ALLOW/THREADS_DISALLOW and their likes are not used in 
new-style functions. 
 
There will be new GC callbacks for walking module global pointers to 
things (see issue "Garbage collection and external references"). 
 
The proposed gc requires that every pointer change in a (heap 
allocated) thing is tracked (for pointers that might point to other 
heap allocated things). This is because the gc has to log the old 
state of the pointers before the first change after a gc run (see 
issue "Garbage collector", item c). For all builtin data types, this 
is handled internally in primitives like mapping_insert and 
object_set_index, so the only cases that the C module code typically 
has to handle are direct updates in the current storage. Therefore all 
pointer changes that currently looks someting like 
 
  THIS->my_thing = some_thing; 
 
must be wrapped in some kind of macro/function call to become: 
 
  set_ptr (THIS, my_thing, some_thing); 
 
On the positive side, all the refcount twiddling to account for 
references from the C and pike stacks can be removed from the C code. 
That also includes a lot of the SET_ONERROR stuff which currently is 
necessary to avoid lost refs when errors are thrown. 
 
 
Issue: C module compatibility 
 
Currently it doesn't look like the goal to keep a source-level 
compatibility mode for C modules can be achieved. The problem is that 
every pointer assignment in every heap allocated thing must be wrapped 
inside a macro/function call to make the new gc work (see issue 
"Garbage collector", item c), and lots of C module code change such 
pointers directly through plain assignments. 
 
Ref issue "Emulating the interpreter lock". 
 
Ref issue "Garbage collection and external references". 
 
 
Issue: Garbage collection and external references 
 
The current gc design is that there is an initial "check" pass that 
determines external references by counting all internal references, 
and then for each thing subtract it from its refcount. If the result 
isn't zero then there are external references (e.g. from global C 
variables or from the C stack) and the thing is not garbage. 
 
The new gc (c.f. issue "Garbage collector") does not refcount external 
refs and refs from the C or Pike stacks. It needs to find them some 
other way: 
 
References from global C variables are few, so they can be dealt with 
by requiring C modules and the core parts to provide callbacks that 
lets the gc walk through them (see issue "C module interface"). This 
is however not compatible with old C modules. 
 
References from C stacks are common, and it is infeasible to require 
callbacks that keep track of them. The gc instead has to scan the C 
stacks for the threads and treat any aligned machine word containing 
an apparently valid pointer to a gc candidate thing as an external 
reference. This is the common approach used by standalone gc libraries 
that don't require application support. For reference, here is one 
such garbage collector, written in C++: 
http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/GarbageCollection/Introduction.html#//apple_ref/doc/uid/TP40002427 
Its source is here: 
http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/ 
 
The same approach would also be necessary to cope with old C modules 
(see issue "C module compatibility"), but since global C level 
pointers are few, it might not be mandatory to get this working. And 
besides, it appears unlikely that compatibility with old C modules can 
be kept. 
 
 
Issue: Global pike level caches 
 
Global caches that are shared between threads are common, and in 
almost all cases such caches are implemented using mappings. There's 
therefore a need for (at least) a hash table data type that handle 
concurrent access and high mutation rates very efficiently. 
 
Issue "Lock-free hash table" discusses such a solution. It's currently 
not clear whether the builtin mappings will be lock-free or not (c.f. 
the copy-on-write problem in issue "Mapping and multiset data 
blocks"), but if they're not then a mapping-like object class is 
implemented that is lock-free. It's easy to replace global cache 
mappings with such objects. 
 
 
Issue: Thread.Queue 
 
A lock-free implementation should be used. The things in the queue are 
typically disowned to allow them to become thread local in the reading 
thread. 
 
 
Issue: "Relying on the interpreter lock" 
 
FIXME 
 
 
Issue: False sharing 
 
False sharing occurs when thread local things used frequently by 
different threads are next to each other so that they share the same 
cache line. Thus the cpu caches might force frequent resynchronization 
of the cache line even though there is no apparent hotspot problem on 
the C level. 
 
This can be a problem in particular for all the block_alloc pools 
containing small structs. Using thread local pools is seldom a 
workable solution since most thread local structs might become shared 
later on. 
 
One way to avoid it is to add padding (and alignment). Cache line 
sizes are usually 64 bytes or less (at least for Intel ia32). That 
should be small enough to make this viable in many cases. 
 
FIXME: Check cache line sizes on the other important architectures. 
 
Another way is to move things when they get shared, but that is pretty 
complicated and slow. 
 
 
Issue: Malloc and block_alloc 
 
Standard OS mallocs are usually locking. Bundling a lock-free one 
could be important. FIXME: Survey free implementations. 
 
Block_alloc is a simple homebrew memory manager used in several 
different places to allocate fixed-size blocks. The block_alloc pools 
are often shared, so they must allow efficient concurrent access. With 
a modern malloc, it is possible that the need for block_alloc is gone, 
or perhaps the malloc lib has builtin support for fixed-size pools. 
Making a lock-free implementation is nontrivial, so the homebrew ought 
to be ditched in any case. 
 
A problem with ditching block_alloc is that there is some code that 
walks through all allocated blocks in a pool, and also avoids garbage 
by freeing the whole pool altogether. FIXME: Investigate alternatives 
here. 
 
See also issue "False sharing". 
 
 
Issue: Heap size control 
 
There should be better tools to control the heap size. It should be 
possible to set the wanted heap size so that the gc runs timely before 
that limit is reached. Pike should detect the available amount of real 
memory (i.e. not counting swap) to use as default. The gc should still 
use a garbage projection strategy to keep the process below the 
configured maximum size for as long as possible. This is more 
important if the gc is used also for previously refcounted garbage 
(c.f. issue "Garbage collector"). 
 
Malloc calls should be wrapped to allow the gc to run in blocking mode 
in case they fail. 
 
 
Issue: The compiler 
 
FIXME 
 
 
Issue: Foreign thread visits 
 
FIXME. JVM threads.. 
 
 
Issue: Pike security system 
 
It is possible that keeping the pike security system intact would 
complicate the implementation, and even if it was kept intact a lot of 
testing would be required before one can be confident that it really 
works (and there are currently very few tests for it in the test 
suite). 
 
Also, the security system isn't used at all to my (mast's) knowledge, 
and it is not even compiled in by default (has to be enabled with a 
configure flag). 
 
All this leads to the conclusion that it is easiest to ignore the 
security system altogether, and if possible leave it as it is with the 
option to get it working later. 
 
 
Issue: Contention-free counters 
 
There is probably a need for contention-free counters in several 
different areas. They should be possible to update from several 
threads in parallell without synchronization. Querying the current 
count is always approximate since it can be changing simultaneously in 
other threads. However, the thread's own local count is always 
accurate. 
 
They should be separated from the blocks they apply to, to avoid cache 
line invalidation of those blocks. 
 
To accomplish that, a generic tool somewhat similar to block_alloc is 
created that allocates one or more counter blocks for each thread. In 
these blocks indexes are allocated, so a counter is defined by the 
same index into all the thread local counter blocks. 
 
Each thread can then modify its own counters without locking, and it 
typically has its own counter blocks in the local cache while the 
corresponding main memory is marked invalid. To query a counter, a 
thread would need to read the blocks for all other threads. 
 
This means that these counters are efficient for updates but less so 
for queries. However, since queries always are approximate, it is 
possible to cache them for some time (e.g. 1 ms). Each thread would 
need its own cache though, since the local count cannot be cached. 
 
It should be lock-free for allocating and freeing counters, and 
preferably also for starting and stopping threads (c.f. issue "Foreign 
thread visits"). In both cases the freeing steps represents a race 
problem - see issue "Hazard pointers". To free counters, the counter 
index would constitute the hazard pointer. 
 
 
Issue: Lock-free hash table 
 
A good lock-free hash table implementation is necessary. A promising 
one is http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.html. 
It requires a CAS (Compare And Swap) instruction to work, but that 
shouldn't be a problem. The java implementation 
(http://sourceforge.net/projects/high-scale-lib) is Public Domain. In 
the comments there is talk about efforts to make a C version. 
 
It supports (through putIfAbsent) the uniqueness requirement for 
strings, i.e. if several threads try to add the same string (at 
different addresses) then all will end up with the same string pointer 
afterwards. 
 
The java implementation relies on the gc to free up the old hash 
tables after resize. The proposed gc (issue "Garbage collector") would 
solve it for us too, but even without that the problem is still 
solvable - see issue "Hazard pointers". 
 
 
Issue: Hazard pointers 
 
A problem with most lock-free algorithms is how to know no other 
thread is accessing a block that is about to be freed. Another is the 
ABA problem which can occur when a block is freed and immediately 
allocated again (common for block_alloc). 
 
Hazard pointers are a good way to solve these problems without leaving 
the blocks to the garbage collector (see 
http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf). So a 
generic hazard pointer tool might be necessary for blocks not known to 
the gc. 
 
Note however that a more difficult variant of the ABA problem still 
can occur when the block cannot be freed after leaving the data 
structure. (In the canonical example with a lock-free stack - see e.g. 
"ABA problem" in Wikipedia - consider the case when A is a thing that 
continues to live on and actually gets pushed back.) The only reliable 
way to cope with that is probably to use wrappers. 
 
 
Issue: Thread local storage 
 
Implementation would be considerably simpler if working TLS can be 
assumed on the C level, through the __thread keyword (or 
__declspec(thread) in Visual C++). A survey of the support for TLS in 
common compilers and OS'es is needed to decide whether this is an 
workable assumption: 
 
o  GCC: __thread is supported. Source: Wikipedia. 
   FIXME: Check from which version. 
 
o  Visual C++: __declspec(thread) is supported. Source: Wikipedia. 
   FIXME: Check from which version. 
 
o  Intel C compiler: Support exists. Source: Wikipedia. 
   FIXME: Check from which version. 
 
o  Sun C compiler: Support exists. Source: Wikipedia. 
   FIXME: Check from which version. 
 
o  Linux (i386, x86_64, sparc32, sparc64): TLS is supported and works 
   for dynamic libs. C.f. http://people.redhat.com/drepper/tls.pdf. 
   FIXME: Check from which version of glibc and kernel (if relevant). 
 
o  Windows (i386, x86_64): TLS is supported but does not always work 
   in dll's loaded using LoadLibrary (which means all dynamic modules 
   in pike). C.f. http://msdn.microsoft.com/en-us/library/2s9wt68x.aspx. 
   According to Wikipedia this is fixed in Vista and Server 2008 
   (FIXME: verify). In any case, TLS is still usable in the pike core. 
 
o  MacOS X: FIXME: Check this. 
 
o  Solaris: FIXME: Check this. 
 
o  *BSD: FIXME: Check this. 
 
 
Issue: Platform specific primitives 
 
Some low-level primitives, such as CAS and fences, are necessary to 
build the various lock-free tools. A third-party library would be 
useful. 
 
o  An effort to make a standardized library is here: 
   http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html 
   (C level interface at the end). It apparently lacks implementation, 
   though. 
 
o  The linux kernel is reported to contain a good abstraction lib for 
   these primitives, along with implementations for a large set of 
   architectures (see 
   http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2006.08.21a.pdf). 
 
o  Another one is part of a lock-free hash implementation here: 
   http://www.sunrisetel.net/software/devtools/sunrise-data-dictionary.shtml 
   It has a MIT-style open source license (with ad clauses). 
 
It appears that the libraries themselves are very short and simple; 
the difficult part is rather to specify the semantics carefully. It's 
probably easiest to make one ourselves with ideas from e.g. the linux 
kernel paper mentioned above. 
 
Required operations: 
 
CAS(address, old_value, new_value) 
  Compare-and-set: Atomically sets *address to new_value iff its 
  current value is old_value. Needed for 32-bit variables, and on 
  64-bit systems also for 64-bit variables. 
 
ATOMIC_INC(address) 
ATOMIC_DEC(address) 
  Increments/decrements *address atomically. Can be simulated with 
  CAS. 32-bit version necessary, 64-bit version would be nice. 
 
LFENCE() 
  Load fence: All memory reads in the thread before this point are 
  guaranteed to be done (i.e. be globally visible) before any 
  following it. 
 
SFENCE() 
  Store fence: All memory writes in the thread before this point are 
  guaranteed to be done before any following it. 
 
MFENCE() 
  Memory fence: Both load and store fence at the same time. (On many 
  architectures this is implied by CAS etc, but we shouldn't assume 
  that.) 
 
The following operations are uncertain - still not known if they're 
useful and supported enough to be required, or if it's better to do 
without them: 
 
CASW(address, old_value_low, old_value_high, new_value_low, new_value_high) 
  A compare-and-set that works on a double pointer size area. 
  Supported on more modern x86 and x86_64 processors (c.f. 
  http://en.wikipedia.org/wiki/Compare-and-swap#Extensions). 
 
FIXME: More.. 
 
Survey of platform support: 
 
o  Windows/Visual Studio: Got "Interlocked Variable Access": 
   http://msdn.microsoft.com/en-us/library/ms684122.aspx 
 
o  FIXME: More.. 
 
 
Issue: Preemptive thread suspension 
 
The proposed gc as presented in the research paper needs to suspend 
and resume other threads. A survey of platform support for preemptive 
thread suspension: 
 
o  POSIX threads: No support. Deprecated and removed from the standard 
   since it can very easily lead to deadlocks. On some systems there 
   might still be a pthread_suspend function. 
 
o  Windows: SuspendThread and ResumeThread exists but are only 
   intended for use by debuggers. 
 
It's clear that a nonpreemptive method is required. See issue "Garbage 
collector" item g for details on that. 
 
 
Issue: OpenMP 
 
OpenMP (see www.openmp.org) is a system to parallelize code using 
pragmas that are inserted into the code blocks. It can be used to 
easily parallelize otherwise serial internal algorithms like searching 
and all sorts of loops over arrays etc. Thus it addresses a different 
problem than the high-level parallelizing architecture above, but it 
might provide significant improvements nevertheless. 
 
It's therefore worthwhile to look into how this can be deployed in the 
Pike sources. If support is widespread enough, it could be considered 
to even make it a requirement to be able to deploy the builtin tools 
for atomicity and ordering (provided they are useful outside the omp 
parallellized blocks). 
 
Compiler support (taken from www.openmp.org): 
 
o  gcc since 4.3.2. 
o  Microsoft Visual Studio 2008 or later. 
o  Sun compiler (starting version unknown). 
o  Intel compiler since 10.1. 
o  ..and some more. 
 
FIXME: Survey platform-specific limitations. 
 
 
Various links 
 
Pragmatic nonblocking synchronization for real-time systems 
  http://www.usenix.org/publications/library/proceedings/usenix01/full_papers/hohmuth/hohmuth_html/index.html 
DCAS is not a silver bullet for nonblocking algorithm design 
  http://portal.acm.org/citation.cfm?id=1007945 
A simple and efficient memory model for weakly-ordered architectures 
  http://www.open-std.org/Jtc1/sc22/WG21/docs/papers/2007/n2237.pdf