Skip to content

Multithreading for KeY#3842

Open
unp1 wants to merge 15 commits into
mainfrom
bubel/mt-goals
Open

Multithreading for KeY#3842
unp1 wants to merge 15 commits into
mainfrom
bubel/mt-goals

Conversation

@unp1

@unp1 unp1 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Intended Change

This pull request adds a multi-core (parallel) prover to KeY: automatic proof
search runs on a worker pool. An open goal is either not yet assigned to a worker or
it has exactly one worker, proving independent
sibling goals concurrently. There are at most as many workers as configured and the number of workers is capped at the number of processor cores. It is a port and modernisation of the 2018
HacKeYthon multithreading prototype, rebuilt on current main.

The legacy single-threaded prover stays the default and the safe fallback;
the multi-core prover is opt-in per the configuration below.

Default in this PR: to make the feature easy to try, this branch ships with
the multi-core prover enabled at 4 threads. When this PR is accepted for
main, the default will be reverted to single-core
so that the
single-core-only features (proof caching, slicing, merge rule) are active as
before. This is a one-line change to the setting default.

Architecture (conceptual)

The guiding principle is isolation, not locking. Goal.apply is split into a
concurrent compute phase (rule selection + matching, the expensive part) and a
commit phase (proof-tree mutation) serialized under a single lock. A
single-monitor GoalScheduler hands goals to workers depth-first. Non-essential
listeners are detached for the run, fresh names come from a per-worker partition
of the name allocator, shared matching/cost caches use thread-safe cache
utilities, and rules not yet concurrency-safe (the merge rule) disable themselves
during a multi-worker run. Only the standard Java profile runs in parallel.

Full conceptual write-up: Multi-Core Proving (key-docs).

Problems encountered, fixed and verified

  • Lost goals / non-closure: sibling goals shared a mutable taclet-index
    cache key and raced on it → a dropped goal and a nondeterministic non-closure.
    Fixed with an immutable per-lookup key. Reproduced in ~50% of 8-worker runs
    before, 0 after; guarded by a CI stress test.
  • Early termination: completing a goal and offering its successors is now a
    single atomic step under the scheduler monitor.
  • Shared-state hazards: a systematic audit hardened the remaining shared
    proving state; each is a separate, behaviour-preserving commit.
  • Off-EDT view updates: per-goal listeners (attached to each Goal, outside
    the proof-level suspension) fired on worker threads and touched Swing off the
    event-dispatch thread. The suspension now covers them too, and the views refresh
    once from the final proof state after a run.

Single-core-only features

Proof caching, proof slicing and the merge rule are switched off while the
multi-core prover is active (GUI greys them out with a tooltip; the status line
shows the mode) and restored when switching back. Loading a proof script and
running macros use the multi-core prover when active and the profile supports it,
with one deliberate exception: TryClose (and the Full Auto pilot that ends in
it) closes goals one at a time under a tight per-goal step budget and is pinned to
the single-core prover — a single goal offers no parallelism, and several workers
exploring its subtree apply rules less step-efficiently and can exhaust the budget
before the goal closes. Either way a run returns only once all workers stop, so
pruning sees a quiescent proof. These restrictions will be lifted incrementally
with the feature owners once the
subset is confirmed safe. Because the default is single-core, these features
work exactly as before.

Configuration

  • GUI: Settings → Prover (Single / Multi-Core) + a clickable SC / MT N×
    status-line indicator.
  • CLI: --threads N.
  • Tests: pinned single-core; -Dkey.prover.parallel[.threads] overrides.

Benchmarks

Measured on a 16-core machine, one isolated run per proof (each proof loaded in a
clean settings state). Cells show wall-clock time with the speedup vs the
single-threaded main
in parentheses; main is the single-threaded prover (the
multi-core prover does not change the single-threaded path).

Real-world proofs — automode (wall time).

Proof main
SimplifiedLinkedList.remove 29.8s 18.7s (1.59×) 10.1s (2.96×) 11.5s (2.60×)
gemplusDecimal/add 12.6s 7.8s (1.63×) 5.9s (2.14×) 6.5s (1.93×)
Saddleback_search 24.6s 18.5s (1.33×) 19.1s (1.29×) 35.4s (0.69×)
symmArray 22.1s 15.2s (1.46×) 11.9s (1.86×) 11.6s (1.90×)
coincidence/project 5.2s 3.6s (1.46×) 3.3s (1.55×) 4.6s (1.13×)
ArrayList.remove.1 3.6s 2.4s (1.51×) 2.3s (1.58×) 2.4s (1.49×)
ArrayList_concatenate 13.5s 6.9s (1.96×) 4.8s (2.82×) 5.4s (2.53×)
arith/median 5.0s 3.2s (1.57×) 2.3s (2.18×) 2.4s (2.12×)
ArrayList.remFirst 0.85s 0.74s (1.15×) 0.61s (1.39×) 0.64s (1.33×)

The 4× column is the sweet spot: wide, splitting proofs reach ~2–3×. Narrow
proofs gain little and can regress at 8× (see the shape table).
(quicksort/sort is omitted — it needs its proof script to close, so it is not a
pure-automode entry.)

Proof Shape Why this speedup
ArrayList_concatenate, SLL.remove wide / splitting many independent sibling goals → ~2.5–3× at 4 workers
symmArray, add, median moderately wide steady ~1.9–2.2× at 4 workers
Saddleback_search narrow (~3-goal frontier) limited parallel width; ~1.3× at 4×, regresses to 0.69× at 8× — extra workers do speculative work off the critical path with no width to amortise it
ArrayList.remFirst tiny (~2k nodes) too small to parallelise; ~1.1–1.4×

Diagrams. Rendered speedup bar charts are in the dev doc's Combined effect section.

Plan

  • Parallel engine, scheduler, isolation mechanisms
  • GUI / CLI integration, feature gating
  • Proof-macro support on the multi-core prover
  • Thread-safety audit of shared proving state
  • Equivalence gate + 8-worker stress tests (CI)
  • spotless / NULLNESS / unit tests green (symb_exec & proof_ref excluded, known-broken)
  • Fill in benchmark tables (diagrams generated, to attach)
  • Switch the default back to single-core before merge

Type of pull request

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue) — see the latent-bug commits below
  • There are changes to the (Java) code
  • There are changes to the deployment/CI infrastructure (gradle, github, ...)

Ensuring quality

  • Introduced/changed code is documented (javadoc + inline comments).
  • End-user features are documented (key-docs: Multi-Core Proving).
  • Added equivalence, stress and unit tests for the new functionality.
  • Tested: the equivalence gate (single vs parallel outcome) and 8-worker stress
    tests pass; full key.core + touched-module unit tests pass.
  • Runtime performance: the single-core path is unchanged (the parallel prover is
    opt-in); see the benchmark section for multi-core speedups.

Separable, main-relevant fixes (kept as standalone commits for cherry-picking,
useful even single-core):

  • Compute NodeInfo's active statement lazily, off the proving path
  • Cache loop-invariant instantiation on the rule app, not a static field (latent
    cross-proof leak via a static field)
  • Make per-call program transformers stateless (MethodCall, JmlAssert)
  • Robustly read numeric settings stored as Integer or Long — fixes a startup
    ClassCastException (present on main) that can prevent KeY from launching when
    loading certain settings files (Cherry-Picked on Fixes for minor regressions on main #3844)
  • Don't crash reloading a recent file with no stored profile — a recent-file
    entry without a profile serialized null as the string "null", so reloading (Cherry-Picked on Fixes for minor regressions on main #3844)
    it failed to resolve a profile and showed an error instead of opening the file
  • Show a proof macro's aggregate statistics in the status line — the status line
    kept a macro's last internal-pass count instead of its aggregate result (Cherry-Picked on Fixes for minor regressions on main #3844)

Additional information and contact(s)

This PR provides a clear opt-in benefit (parallel speedup on splitting proofs)
with a safe fallback to the unchanged single-core prover.

This PR has been done with AI tooling support.

The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.

unp1 added 6 commits June 18, 2026 15:13
The active statement (and its position) were eagerly computed when a node's
rule application was set, on the proving path. Compute them lazily on first
access instead; nodes whose active statement is never inspected do no work.
The loop-invariant rule cached the last instantiation in a static field shared
across proofs, which could leak an instantiation between unrelated proofs.
Store it on the rule application instead.
Give the MethodCall metaconstruct a fresh per-application copy instead of
mutating shared instance state, make the class final to lock in that argument,
and make JmlAssert's condition final.
A stored settings value may deserialize as Integer or Long depending on its
magnitude and the format, but the numeric property converters assumed a fixed
boxed type and threw a ClassCastException on the other -- which could crash KeY
on startup while loading the settings. Accept any Number instead.
A recent-file entry whose profile name was never set serialized the null value as
the string "null"; reloading it then tried to resolve a profile named "null",
failed, and showed an error dialog instead of opening the file. Treat a missing
profile name (null or the legacy "null" string) as "use the default profile" on
load, and stop writing the "null" placeholder when saving.
After a compound macro finished, the status line kept the partial count from the
macro's last internal strategy pass (e.g. "Applied 2 rules, 1 goal" even though
the macro had closed thousands) instead of the macro's own aggregate. Display the
ProofMacroFinishedInfo's aggregate result when a macro completes.
ConcurrentLruCache (exact-LRU, single-lock, drop-in for synchronizedMap over an
LRUCache) for eviction-sensitive caches, StripedLruCache for pure caches, and
WeakValueInterner for identity-preserving interning under concurrency, with
concurrency tests.
@unp1 unp1 force-pushed the bubel/mt-goals branch from 1518256 to cd9f12e Compare June 18, 2026 21:32
@unp1 unp1 changed the title Multithreading: a multi-core prover for automatic proof search Multithreading for KeY Jun 18, 2026
@unp1 unp1 self-assigned this Jun 18, 2026
Add the parallelProverEnabled / parallelProverThreadCount settings and a
Profile.supportsParallelAutomode() capability: the standard Java profile opts
in, the well-definedness, information-flow and symbolic-execution profiles keep
the single-core fallback.
@unp1 unp1 force-pushed the bubel/mt-goals branch from cd9f12e to 5fc660b Compare June 18, 2026 22:29
@unp1 unp1 marked this pull request as ready for review June 18, 2026 23:15
unp1 added 6 commits June 19, 2026 06:25
Add EssentialProofListener and Proof.suspendNonEssentialListeners(), which
detaches every proof-tree and rule-app listener not tagged essential for the
duration of a run and re-attaches them afterwards, so nothing unrelated to
proving fires on a worker thread.
ParallelProver runs automatic proof search on a worker pool, one open goal per
worker, behind the AutoProvers selection seam. A single GoalScheduler monitor
hands out goals depth-first; Goal.apply is split into a concurrent compute phase
and a commit phase serialized on one lock; fresh names come from a partitioned
per-proof allocator (atomic Counter), namespace flushes are deferred, and the
merge rule is disabled during parallel runs. The taclet-index cache key is
immutable so siblings sharing the cache set cannot race on it. Includes the
scheduler unit test and the proof-equivalence gate.
Harden the shared state the workers touch during search: route the matching and
cost-path caches through the thread-safe cache utilities, intern parametric
operators and elementary updates identity-preservingly, give the per-call
instantiation caches and counters safe publication or per-worker copies, and
guard the specification repository's proving-time maps. Behaviour-preserving.
Add a settings pane and a clickable status-line indicator to switch between the
single-core and multi-core prover, a --threads CLI option, and gating of the
single-core-only features: proof caching, slicing and the merge rule are
disabled (with explanatory tooltips, updated live) while the multi-core prover
is active, and restored when switching back.
Route StrategyProofMacro and TryCloseMacro through the AutoProvers seam. The
try-close prune stays safe because start() only returns once every worker has
stopped, so no worker is live during the tree mutation; a stress test exercises
the prune-and-close path at eight workers.
Add the real-proof speedup benchmark, the synthetic best/worst-case benchmark,
a JFR profiling probe, and a high-worker non-closure stress test wired into the
CI integration-test matrix. Pin the test suite to the single-core prover by
default; opt-in parallel tests override it at runtime.
@unp1 unp1 force-pushed the bubel/mt-goals branch from 5fc660b to 7cf6734 Compare June 19, 2026 04:26
@wadoon

wadoon commented Jun 19, 2026

Copy link
Copy Markdown
Member

one thread per open goal

  1. So you are creating 1000 threads if you have 1000 open goals?
    Or do you mean, you create 1000 callable instances, ...

The idea of a worker pool is to have a fix/max amount of threads, and to dispatch work on them.

  1. Are you using/Consider to use the new virtual threads. Somebody would also call them greenlets, or stackless...

@unp1

unp1 commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

one thread per open goal

  1. So you are creating 1000 threads if you have 1000 open goals?
    Or do you mean, you create 1000 callable instances, ...

no that is just the starting idea. Depending on the amounts of threads you configure to use, you have that nr of threads and it is capped at the number of processor cores of your computer. Then there is a worker pool and a lifo queue from which a free worker is assigned the next goal. I'll clarify the description thanks.

The idea of a worker pool is to have a fix/max amount of threads, and to dispatch work on them.

  1. Are you using/Consider to use the new virtual threads. Somebody would also call them greenlets, or stackless...

not yet. The first milestone was to get to a point where we have for some problems a win and KeY is (at least as far as one can test) race free. The problem is, I did not have any race conditions even under high pressure. But the problem is that other OSes with other schedulers might discover some still existing problems, so if someone has time and can just run it once in a while to get feedback, would be appreciated.

For trying out, I recommend the preview integration with the single core improvemnts. It solves some congestions here already.

@wadoon

wadoon commented Jun 19, 2026

Copy link
Copy Markdown
Member

Could you provide the command to be executed? I could try to get a 64-core machine if we would have proofs having such number of goals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants