-rw-r--r-- 4168 saferewrite-20260201/README-resources raw
Recommendations: (1) use a modern server with a reasonable amount of RAM, (2) limit the number of threads that saferewrite will use (see the THREADS mechanism in README), and (3) select the analyses that you're interested in (see the chmod mechanism in README). The following measurements were collected on rome2, a dual EPYC 7742 (128 cores overall), overclocking disabled (so CPUs were running at 2.245GHz), 512GB RAM, 1TB swap space, running Debian 12. One run (before elfulator was installed) analyzed all 248 cryptoint functions (not just the 248 cryptoint implementations of those functions but also 391 other implementations): chmod +t src/* chmod -t src/{int,uint}{8,16,32,64}_* chmod +t src/uint8_7bit_nonzero_mask_int16 env THREADS=64 time ./analyze There were 639 implementations in total, each of which was compiled with 12 compilers, so 7668 implementation-compiler combinations total. Timings: 21884.21user 6898.45system 10:05.60elapsed 4752%CPU (0avgtext+0avgdata 1117584maxresident)k 302048inputs+3216840outputs (2182847major+911883360minor)pagefaults 0swaps The analysis cost was thus 3.8 core-seconds on average for each implementation-compiler combination. Overall memory consumption for 64 threads varied but was never observed to pass 10GB. Most processes had RSS under 200MB. The occasional process with RSS around 1GB had total RAM usage around 3.5GB. Disk usage for the unprivileged user carrying out measurements was 2GB at this point, not counting the space for installed system packages. Installing elfulator (see README-elfulator) increased the user's disk usage to 16GB (mostly from 6.5GB for build-sparc and 7.1GB for build-sparc64). Unrolling 5 implementations of int32_negative_mask for sparc32 under elfulator took 1157 seconds for the fastest implementation and 1316 seconds for the slowest. RSS for each analysis was consistently under 4GB. For comparison, unrolling for amd64 without elfulator took 0.25 seconds for the fastest implementation and 0.40 seconds for the slowest. Unrolling has higher cost with elfulator because of the double emulation layers. Adding elfulator lines for sparc64, arm64, arm32 showed unrolling times between 3759 seconds and 3938 seconds for sparc64; between 14261 seconds and 15656 seconds for arm64; and between 67094 seconds and 95179 seconds for arm32. Returning to the original compilers list, 13 C compilers including sparc32 with elfulator, and analyzing all 248 cryptoint functions (so 8307 compiler-implementation combinations total) with 64 threads had the following timings: 883025.84user 9995.15system 5:14:11elapsed 4737%CPU (0avgtext+0avgdata 7848932maxresident)k 1664inputs+3467496outputs (2140965major+3432806724minor)pagefaults 0swaps The 639 elfulator analyses thus added 14404 core-minutes, i.e., 22.5 core-minutes on average per implementation-compiler combination. Overall memory consumption for 64 threads was never observed to pass 250GB. The maximum observed RSS for a single process was 8GB. The maximum observed VSZ for a single process was 137GB. For analyses of the cryptoint implementations, the maximum observed RSS for a single process was 4GB, and the maximum observed VSZ was 20GB. Out of all 8307 compiler-implementation combinations, there were 8193 successfully unrolled combinations (including all 3224 compilations of the cryptoint implementations), including 575 of the implementations compiled for sparc32 (including all 248 of the cryptoint implementations compiled for sparc32). There were 8156 results marked equals-* (including all 3224 compilations of the cryptoint implementations), including 570 of the implementations compiled for sparc32 (including all 248 of the cryptoint implementations). Experiments replacing python3 with pypy3 reduced user time by 2x while increasing RAM usage by about 1.5x. Unfortunately, pypy3 occasionally hangs in __futex_abstimed_wait_common64; currently saferewrite doesn't know how to recognize the hang and restart the process. Some src/* functions are more complicated than the cryptoint functions. Analyses of single implementations have been observed using 100GB RSS.