-rw-r--r-- 4168 saferewrite-20260201/README-resources raw
Recommendations: (1) use a modern server with a reasonable amount of
RAM, (2) limit the number of threads that saferewrite will use (see the
THREADS mechanism in README), and (3) select the analyses that you're
interested in (see the chmod mechanism in README).
The following measurements were collected on rome2, a dual EPYC 7742
(128 cores overall), overclocking disabled (so CPUs were running at
2.245GHz), 512GB RAM, 1TB swap space, running Debian 12.
One run (before elfulator was installed) analyzed all 248 cryptoint
functions (not just the 248 cryptoint implementations of those functions
but also 391 other implementations):
chmod +t src/*
chmod -t src/{int,uint}{8,16,32,64}_*
chmod +t src/uint8_7bit_nonzero_mask_int16
env THREADS=64 time ./analyze
There were 639 implementations in total, each of which was compiled with
12 compilers, so 7668 implementation-compiler combinations total.
Timings:
21884.21user 6898.45system 10:05.60elapsed 4752%CPU (0avgtext+0avgdata 1117584maxresident)k
302048inputs+3216840outputs (2182847major+911883360minor)pagefaults 0swaps
The analysis cost was thus 3.8 core-seconds on average for each
implementation-compiler combination.
Overall memory consumption for 64 threads varied but was never observed
to pass 10GB. Most processes had RSS under 200MB. The occasional process
with RSS around 1GB had total RAM usage around 3.5GB.
Disk usage for the unprivileged user carrying out measurements was 2GB
at this point, not counting the space for installed system packages.
Installing elfulator (see README-elfulator) increased the user's disk
usage to 16GB (mostly from 6.5GB for build-sparc and 7.1GB for
build-sparc64).
Unrolling 5 implementations of int32_negative_mask for sparc32 under
elfulator took 1157 seconds for the fastest implementation and 1316
seconds for the slowest. RSS for each analysis was consistently under
4GB. For comparison, unrolling for amd64 without elfulator took 0.25
seconds for the fastest implementation and 0.40 seconds for the slowest.
Unrolling has higher cost with elfulator because of the double emulation
layers.
Adding elfulator lines for sparc64, arm64, arm32 showed unrolling times
between 3759 seconds and 3938 seconds for sparc64; between 14261 seconds
and 15656 seconds for arm64; and between 67094 seconds and 95179 seconds
for arm32.
Returning to the original compilers list, 13 C compilers including
sparc32 with elfulator, and analyzing all 248 cryptoint functions (so
8307 compiler-implementation combinations total) with 64 threads had the
following timings:
883025.84user 9995.15system 5:14:11elapsed 4737%CPU (0avgtext+0avgdata 7848932maxresident)k
1664inputs+3467496outputs (2140965major+3432806724minor)pagefaults 0swaps
The 639 elfulator analyses thus added 14404 core-minutes, i.e., 22.5
core-minutes on average per implementation-compiler combination.
Overall memory consumption for 64 threads was never observed to pass
250GB. The maximum observed RSS for a single process was 8GB. The
maximum observed VSZ for a single process was 137GB. For analyses of
the cryptoint implementations, the maximum observed RSS for a single
process was 4GB, and the maximum observed VSZ was 20GB.
Out of all 8307 compiler-implementation combinations, there were 8193
successfully unrolled combinations (including all 3224 compilations of
the cryptoint implementations), including 575 of the implementations
compiled for sparc32 (including all 248 of the cryptoint
implementations compiled for sparc32). There were 8156 results marked
equals-* (including all 3224 compilations of the cryptoint
implementations), including 570 of the implementations compiled for
sparc32 (including all 248 of the cryptoint implementations).
Experiments replacing python3 with pypy3 reduced user time by 2x while
increasing RAM usage by about 1.5x. Unfortunately, pypy3 occasionally
hangs in __futex_abstimed_wait_common64; currently saferewrite doesn't
know how to recognize the hang and restart the process.
Some src/* functions are more complicated than the cryptoint functions.
Analyses of single implementations have been observed using 100GB RSS.