Skip to content

[Feature] Add Distributed Posting Router for SPANN#448

Draft
TerrenceZhangX wants to merge 95 commits intousers/qiazh/pre-merge-tikv-bugfixfrom
users/zhangt/merge-distributed-to-tikv
Draft

[Feature] Add Distributed Posting Router for SPANN#448
TerrenceZhangX wants to merge 95 commits intousers/qiazh/pre-merge-tikv-bugfixfrom
users/zhangt/merge-distributed-to-tikv

Conversation

@TerrenceZhangX
Copy link
Copy Markdown

No description provided.

zhol01825 and others added 30 commits April 7, 2026 07:33
…c, benchmarks

Core routing (PostingRouter.h):
- Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment
- RemoteLock RPC for cross-node Merge serialization (try_lock + retry)
- BatchAppend, HeadSync, InsertBatch packet types and handlers
- TCP-based server/client for inter-node communication

ExtraDynamicSearcher.h integration:
- EnableRouter/AdoptRouter for index lifecycle management
- Split: BroadcastHeadSync after creating/deleting heads
- MergePostings: Cross-node lock for neighbor headID on different node
- MergePostings: BroadcastHeadSync for deleted head after merge
- Reassign: Route Append to owner node + FlushRemoteAppends
- AddIndex: Route appends to owner node via QueueRemoteAppend
- SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks

Infrastructure:
- IExtraSearcher/Index/VectorIndex: Add routing virtual method chain
- Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex,
  RouterNodeAddrs, RouterNodeStores config params
- CMakeLists: Link Socket sources and Boost into SPTAGLibStatic
- Connection.cpp: Safe remote_endpoint() with error_code (no throw)
- Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types
- SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test
- Benchmark configs: 100k/1m/10m x 1/2/3 node
- run_scale_benchmarks.sh: Automated benchmark runner
- docker/tikv: TiKV cluster docker-compose + config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix index dir creation: create parent dir only (not spann_index subdir)
  so the build code creates spann_index and the safety check passes
- Clear checkpoint after build phase so driver re-runs all insert batches
- Add VectorIndex::AdoptRouter to transfer router between batch clones
  instead of creating new TCP server per batch (port conflict fix)
- Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface

100k results (routing works, ~50/50 local/remote split):
  1-node steady-state: 90.2 vps
  2-node steady-state: 80.5 vps
  3-node steady-state: 77.7 vps
No scaling at 100k due to small batch size (100 vectors) and shared TiKV.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 slides covering: problem statement, solution architecture, full flow
comparison (single vs distributed), hash routing, append write path,
split/merge/reassign routing, HeadSync broadcast, 3-node sequence
diagram, design decisions, network protocol, config, and 100k results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads
- Add 100M benchmark ini files (1-node, 2-node, 3-node)
- Update data paths to /mnt/data_disk/sift1b in all ini files
- Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English)
- Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results
- Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster
- Update run_scale_benchmarks.sh with multi-scale orchestration
- Add .gitignore entries for generated benchmark artifacts
- Fix FullSearch routing for multi-node search (per-node build)
- Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x
- Search latency 10M: 2-node -35%, 3-node -50% vs 1-node
- Near-linear insert scaling across all data sizes (100K, 1M, 10M)
- Update benchmark configs, test harness, and scale benchmark script
- Delete all 24 benchmark INI files from Test/
- Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete
  deterministic generation rules (scale table, topology rules, template,
  Python generator script)
- INI files can be regenerated on demand via the guide
- Embed full docker-compose.yml and tikv.toml contents in
  BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1
- Remove docker/tikv/ from git tracking (files stay on disk)
- Use <NVME_DIR> placeholder instead of absolute paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes:
- SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that
  corrupted remote search results (heapsort on already-sorted data)
- TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K

Feature:
- SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches

Benchmark results (Float32/dim64, 10M scale):
- 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x)
- Recall stable within each config after double-sort fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Measure RPC round-trip time (sendTime → future.get()) and assign
per-query latency for remote search results. Previously p50/min
latency showed 0 for remote queries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...)
that broke cross-node version check correctness. Version map now uses
shared namespace (vc:{layer}:...) so all nodes can read/write the same
version data. Node-scoped optimization deferred to future branch.

Also add explanatory comments for AdoptRouter and HeadSync broadcast.
Restore commented-out debug log lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were only used by the reverted node-scoped version map.
LocalToGlobalVID is kept (used in insert path for VID uniqueness).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertVectors now uses dual path: per-vector multi-threaded insert
for single-node (original behavior), bulk AddIndex for router-enabled
multi-node (amortizes RPC overhead via batched remote appends).

Also add comment on AddIndex explaining caller-side shard partitioning
and LocalToGlobalVID purpose.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the original vidIdx counter loop for post-heap version
filtering instead of the candidateIndices array approach. Both are
functionally equivalent but the original pattern is simpler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore 'For quantized index, pass GetFeatureDim()' comment
- Remove else { func(); } branch that was not in the original code

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ConvertToString(valueType) and INI parameters to build the
perftest_* filenames, matching TestDataGenerator::RunLargeBatches
convention. Moves filename construction out of the per-batch loop.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the 3-file INI pattern for multi-node benchmarks:
- _build.ini: Rebuild=true, no Router (build phase)
- _driver.ini: Rebuild=false, Router enabled (driver/n0)
- _n{i}.ini: worker nodes (n1, n2, ...)

Update Python generator and shell script to match. Remove the
sed-based approach of patching _n0.ini at runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Router parameters (RouterEnabled, RouterLocalNodeIndex,
RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to
DefineRouterParameter with its own [Router] section in:
- ParameterDefinitionList.h: new DefineRouterParameter macro
- Options.h: SetParameter/GetParameter handle 'Router' section
- SPANNIndex.cpp: SaveConfig outputs [Router] block
- SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams
  uses 'Router' section
- Benchmark guide: updated INI template and Python generator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md,
not the single-node 10M results file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing
router parameters to be silently ignored. Moved the #endif to the correct
position so DefineRouterParameter is at top-level scope. Also removed a
debug log line from the DefineRouterParameter macro in Options.h.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement benchmark-level query distribution where each node independently
searches its contiguous partition of the query set, coordinated by barrier
files (same mechanism as insert distribution). This replaces the previous
RPC-based approach, eliminating RPC overhead and serial head search.

Key changes in SPFreshTest.cpp:
- BenchmarkQueryPerformance: partition queries across nodes, use barrier
  files for synchronization, compute QPS = totalQueries / max(wallTime)
- WorkerNode: unified command loop handling both search and insert commands
  via shared index directory

Results (10M Float32, 200 queries, TopK=5):
- 1-node: 194 QPS baseline
- 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects)
- 3-node: 488 QPS (2.52x speedup)
- Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h)
   - RAII RemoteLockGuard ensures remote lock is released on all exit
     paths (continue/return/exception), preventing distributed deadlock

2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp)
   - Validate response array sizes before accessing result vectors
   - Fall back to local search on size mismatch

3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h)
   - Add failure callback to complete the promise on send error,
     matching the pattern used by all other SendPacket call sites
   - Prevents 5-second stall on every send failure

4. Normalize atomic operation in SendRemoteLock (PostingRouter.h)
   - Change m_nextResourceId++ to fetch_add(1) for consistency

5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp)
   - Initialize workerTime and validate ifstream read
   - Skip worker timing on parse failure instead of using garbage value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all
supporting RPC infrastructure (SearchPostingRequest/Response,
FullSearchBatchRequest/Response structs, search callbacks, handler methods,
and related member variables) from PostingRouter, SPANNIndex, Index,
VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet.

Tests use barrier-based distributed search exclusively; the RPC-based search
routing is dead code. Existing SearchRequest/SearchResponse packet types are
preserved as they are used by the pre-existing Aggregator/Client/Server code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes:
- Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h)
  Prevents OOB reads from corrupt TiKV responses with large numResults.
- Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths
  (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob
  to match the unconditional decrement in exec().

High fixes:
- Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h)
  Ensures queued remote appends are sent after CollectReAssign in Split().
- Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h)
  Snapshot address under m_connMutex before retry loop.
- Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h)
  Snapshot node count under m_connMutex before iterating.

Medium fixes:
- Fix m_storeToNodes race in AddNode - move inside m_connMutex scope.
- Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end
  tracking to prevent overruns from corrupt packets.
- Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch
  overruns during per-item deserialization.
- Make m_asyncStatus atomic to fix race between async jobs and Checkpoint.
  Use exchange() for atomic read-and-reset in Checkpoint.
- Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB
  parallel loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ubuntu and others added 19 commits April 23, 2026 06:14
When SKIP_HEAD_BUILD=1 is set, the script checks if HeadIndex exists
in the index directory. If found, uses RebuildSSDOnly=true to skip
SelectHead+BuildHead phases (only rebuilds SSD postings). If HeadIndex
is missing, falls back to full build automatically.

This saves significant time on re-runs where only SSD postings need
to be rebuilt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When NOCACHE=1, builds at large scales (e.g. 100M) are impractical because
each split read goes through cold TiKV. With BUILD_WITH_CACHE=1, the build
phase keeps TiKV block cache, OS page cache, and VersionCache enabled. Then
before the search/insert phase the script swaps TiKV to tikv_nocache.toml
(restarting only the TiKV containers, preserving data and PD state) and
drops OS caches, so the measured search/insert phase runs fully cold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ig override

- CollectReAssign: use MultiScanPostings instead of MultiGet in multi-chunk
  mode so nearby-heads scanning reads actual posting data (not just count key)
- SPFreshTest: read AppendThreadNum from [Benchmark] section and pass as
  ssdOverride so it overrides the hardcoded default (was using insertthread)
ExtraDynamicSearcher.h:
- Split cluster ks ordering, head vector preservation
- Multi-chunk MultiScanPostings in CollectReAssign
- Error code propagation through PutPostingToDB / AddHeadIndex paths

ExtraTiKVController.h:
- Retry + region_error handling for Delete / Set/GetPostingCount /
  PutChunk / PutBaseChunk / ScanPosting / DeleteRange paths
- Distinguish VectorNotFound from gRPC error
Add per-skip Warning log when a posting buffer contains a VID outside
[0, m_versionMap->Count()). Helps diagnose data corruption / sentinel
VIDs leaking into Reassign scan.
In CollectReAssign's nearby-heads scan (m_reassignK > 0), postings were
read into p_exWorkSpace->m_pageBuffers and iterated via raw pointers
into those buffers. tryBatchReassign -> RNGSelection -> SearchHeadIndex
-> SearchDiskIndex -> searcher->SearchIndex(p_exWorkSpace, ...) reuses
the SAME m_pageBuffers for its own MultiGet/MultiScanPostings,
overwriting (or reallocating) the buffer the outer loop is mid-scan
over. This produced "skip invalid VID" warnings whose tail bytes
matched leftover/missing-key returns (0xFF/0xFE patterns).

Fix: snapshot each nearby posting into a local std::string before
iterating, mirroring the safe pattern already used by the postingLists
loop above.
…ix; eval 2026-04-23 batch 1-3 results

DIAG instrumentation (IExtraSearcher.h IndexStats):
- 5 atomic log2 histograms (22 buckets, 1us-1s+ / 1B-1MB+):
  AppendLockWait, AppendGetUs, AppendPutUs, AppendPostingBytes, SplitLockWait
- HistBucketOf / HistAdd / FormatHist helpers
- PrintStat unconditionally emits 5 [DIAG] lines

ExtraDynamicSearcher.h:
- Append single-chunk RMW path: time lock-wait, Get, Put; record bytes
- Split: time write-lock acquisition
- AllFinished ALL DONE branch: per-layer [DIAG] dump
- MergeAsyncJob re-queue: increment m_mergeJobsInFlight + m_totalMergeSubmitted
  (was missing, caused in-flight underflow + completed > submitted)

Eval (evaluation/2026-04-23):
- benchmark_spfresh_sift1b_v10_multichunk.ini (UseMultiChunkPosting=false, 16 insert threads)
- benchmark_multichunk_20260424_023309.log (live run incl. batch 1-3 DIAG output)
- output.json (per-batch search/insert metrics)
- ANALYSIS_BATCH1-3.md: client vs TiKV-server latency; concludes server is healthy
  (raw_get/put avg <1ms, 0 stalls, 78 regions); client-side overhead is the
  bottleneck (10-37x gap, growing per batch)
…s by suffix

Upstream agent (eng-spfresh-worker) committed 4 sdr-host-specific config
tweaks (paths to /mnt/nvme, TiKV memory to 40GB) plus refresh-reference.sh
and run_perf_validation.sh helpers.

Auto-merge took upstream's path/memory changes since our side had not
modified these templates. To keep both environments working without
fighting on every push:

- *_sdr.ini / tikv_sdr.toml         upstream's sdr-host config (saved)
- *_template.ini / tikv.toml        restored to .9/.7 paths + 200GB cache

New files added:
- configs/benchmark_insert_dominant_template.ini  10\u00d7 expansion workload
- /cluster_2node.conf                              .9 driver + .7 worker

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rge-distributed-to-tikv

Resolve conflicts preserving features from both sides:
- HEAD: distributed worker/dispatcher/router, AsyncAppend, MergeAsyncJob(3-arg),
  m_asyncStatus, headVec layout with metaDataSize prefix, NumSearchDuringInsertThreads
- qiazh: VersionReadPolicy, TiKV PutBaseChunkAndCount/AsyncGetPostingCounts batch
  RPCs, AddIndexAsync* (multi-chunk + single-key), Append RMW data-loss fix
  (distinguish Key_NotFound from errors), GetCachedPostingCount -1 sentinel,
  extensive DIAG stats and histograms (reassignJobLatency, queueDepth,
  splitNewHeadCount, MC histograms), batch barrier wait timing.

Build verification: Release/SPTAGTest builds cleanly with -DTIKV=ON -DTBB=ON.
Added absl_synchronization to SPTAGTest link deps (qiazh's main.cpp uses
absl::SetMutexDeadlockDetectionMode).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore evaluation/2026-04-23/* to qiazh's pre-merge-tikv-bugfix versions.
- Drop zhangt-only evaluation/2026-04-20-merge/ and evaluation/backend_comparison/.
- Trim evaluation/distributed/ to only the Insert dominant benchmark essentials
  (README.md, run_distributed.sh, configs/benchmark_insert_dominant_template.ini).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@TerrenceZhangX TerrenceZhangX force-pushed the users/zhangt/merge-distributed-to-tikv branch from dfc4c89 to a158014 Compare May 7, 2026 13:55
TerrenceZhangX and others added 10 commits May 7, 2026 14:08
Template (benchmark_insert_dominant_template.ini):
- Align knobs with qiazh's benchmark_spfresh_sift1b_v10_multichunk.ini:
  AppendThreadNum 48->144 (I/O-bound async append, ~3x cores for higher
  in-flight TiKV RPC concurrency), add PostingPageLimit=8,
  PostingCountCacheCapacity=1000000, DistributedVersionMap=true.
- Drop deprecated VersionCacheTTLMs (qiazh removed in this branch).
- Default data paths to /mnt/nvme/sift1b/{bigann_base.u8bin,query.10K.u8bin}
  (matches our mounted RAID); TruthPath=truth (auto-generated by SPTAGTest).
- Add TiKVPDAddresses=PLACEHOLDER for multi-machine fill-in by run_distributed.sh.

README:
- Reframe around the multi-machine workflow (single-machine multi-process
  scripts are no longer included).
- Adopt qiazh's TiKV operational structure: prerequisites, per-node start,
  health check, optional pre-split & scatter (adapted to per-node independent
  PDs since each machine runs its own TiKV instead of one shared 3-PD cluster),
  key-knob table, and per-batch output JSON structure.
- Preserve our distributed-specific sections: architecture diagram, TiKV
  deployment model (independent per-node, no cross-node Raft), dispatch
  protocol, run_distributed.sh subcommands and env overrides, troubleshooting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In distributed mode, each worker process needs to connect to its own local
PD/TiKV instance. The driver writes the per-process TiKVPDAddresses into
ssdOverrides (indexed by WORKER_INDEX), but RunWorker was calling
VectorIndex::LoadIndex(path, index) without applying those overrides — so
the worker inherited the driver's PD address from the saved indexloader.ini
and silently routed every KV write back to the driver's TiKV regardless of
the consistent-hash routing decision.

Fix: add a LoadIndex overload that takes a parameter overrides map and
applies it between LoadIndexConfig and LoadIndexData, ensuring m_options
reflects worker-local settings before PrepareDB constructs the TiKVIO
connection. RunWorker now calls this overload with ssdOverrides.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hput fixes

Brings the `users/zhangt/merge-distributed-to-tikv` branch to a working
2-node sharded SPTAG-on-TiKV insert-dominant pipeline. Fixes 26 bugs
catalogued in `docs/distributed-bugs.md`; this is the consolidated landing
commit for review.

High-level architecture
-----------------------
- Dispatcher / Driver / Worker tri-process layout (1 dispatcher + N workers).
  Driver is also worker 0; remote workers are 1..N-1.
- Routing: consistent-hash ring keyed by headID. Each posting has exactly
  one owner node. Head index is replicated to all nodes via HeadSync.
- Build: split workers route postings to owner via PutPosting; receivers
  write to local TiKV.
- Insert: AppendCallback splits/merges locally; cross-node forward via
  BatchRemoteAppend; receiver runs the same callback.
- Search: scatter-gather. SearchIndex groups candidates by owner,
  RemoteFetchPostings pulls non-local postings, results merged at driver.
- Per-node TiKV cluster + per-node `TiKVVersionMap` (no shared state).

Key fixes in this drop (see docs for full per-bug detail)
---------------------------------------------------------
Build / resume:
  Bug 1   InitWorkSpace dispatch restored (AddIndex segfault)
  Bug 14  Resume mode: reset m_topIndex per layer
  Bug 15  Resume mode: HeadIDFile path uses prior layer suffix
  Bug 16  Resume mode: don't double-translate p_headToLocal

Distributed insert correctness:
  Bug 2   Insert path actually dispatches to workers when router enabled
  Bug 5   BindWorkerToAllLayers (layer 1+ now has m_worker)
  Bug 11  GetOwner gate on Split / MergePostings / Reassign read paths
  Bug 17  True sharded build (PutPosting) + sharded search (FetchPostings)
  Bug 18  Reassign no longer inlines FlushRemoteAppends (cross-node deadlock)
  Bug 19  AppendCallback unconditionally injects head record (HeadSync race)
  Bug 21  MergePostings owner check moved outside rwLock-bucket equality

TiKV / version map:
  Bug 3   LoadIndex multi-chunk dimensioning
  Bug 4   AddIDCapacity multi-node growth = numNodes * delta
  Bug 13  AddBatchInitialDeleted bulk path (avoids O(N) RPC storm)
  Bug 22  Distributed version map convergence: SetVersionBatch on
          AppendCallback receiver and SearchIndex post-fetch loop

Networking / RPC robustness:
  Bug 6   SafeSimpleReadBuffer (no more out-of-bounds reads on bad packets)
  Bug 7   m_callbackLifetimeMutex protects callback invocation vs destruction
  Bug 8   SendBatchRemoteAppend chunks to avoid 60s RPC timeout
  Bug 10  FlushRemoteAppends serialised + bumped server thread / connection
          pools (no thundering-herd 143-RPC parse-failed storm)
  Bug 12  Search scatter-gather (was: single-node candidates only) [via 17]
  Bug 20  ClearCallbacksIfOwner with owner-token (multi-layer co-existence)

Throughput / livelock under sustained load:
  Bug 23  HandleBatchAppendRequest fan-out cap 16 -> 4 (TiKV chunk-mutex)
  Bug 24  MergePostings retry bound + remote-bucket explicit defer
  Bug 25  SetVersionBatch numNodes-aware bound + AddBatch fills 0xff
  Bug 26  RemotePostingOps background executor: batch handlers no longer
          occupy network-pool threads, freeing them for FetchPostings /
          HeadSync (search recall protected from insert head-of-line).

Files
-----
Core changes are in:
  AnnService/inc/Core/SPANN/Distributed/   (new sharded RPC plane)
  AnnService/inc/Core/SPANN/ExtraDynamicSearcher.h
  AnnService/inc/Core/Common/{IVersionMap,TiKVVersionMap}.h
  AnnService/inc/Socket/ (defensive parsing, larger pool)
  Test/src/SPFreshTest.cpp (BindWorkerToAllLayers, dispatcher timeout)

Configs:
  evaluation/distributed/configs/cluster_2node.conf       (new, was at root)
  evaluation/distributed/configs/benchmark_insert_dominant_2node.ini (new)
  evaluation/distributed/configs/tikv.toml               (new)

Docs:
  docs/distributed-bugs.md  (full per-bug write-up + validation table)

Validation status
-----------------
  v17: Bug 23/25 confirmed (recall 0.9520 batch 0; 0 OOR rejections;
       0 PD reconnects). Recall 0.228 in batch 1 traced to FetchPostings
       60s timeouts caused by network-thread starvation -> Bug 26 fix.
  v18: pending — verify Bug 26 brings recall back to 0.94.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pulls in 6 upstream commits:
  - 93764e0 Optimize TiKV insert RPC and version map batching
  - ac465eb Add AddIndex lock timing diagnostics
  - fe0fe89 Optimize TiKV async posting RPCs
  - 9408d96 Add layer0-only version map search option
  - 98d7421 Refine TiKV version map search checks
  - 87bb2f1 Refine TiKV version map cache handling

Conflicts resolved while preserving Bug 11/19/23-26 distributed
invariants in ExtraDynamicSearcher.h.

IVersionMap.h: kept BOTH our AddBatchInitialDeleted (used by
distributed AddIDCapacity with capa*numNodes growth) AND upstream's
AddBatch(num, deleted) overload.

TiKVVersionMap.h: took upstream's CacheFresh/CacheTTL (87bb2f1) +
striped refresh mutexes; kept our AddBatchInitialDeleted bulk-tombstone
override; auto-merged AddBatch(num, deleted) override (93764e0).

ExtraDynamicSearcher.h: kept ours for AddIDCapacity (distributed
numNodes-aware growth + AddBatchInitialDeleted); auto-merged everything
else (kMaxMergeRetries=64, m_remoteBucketLocked, m_worker->GetOwner
gates, SetVersionBatch in AppendCallback / SearchIndex remote-fetch loop,
upstream's bool p_checkVersionMap parameter, ShouldCheckVersionMapInSearch,
fe0fe89's MultiGetWithStatus/MultiPutWithStatus optimization).

Build verified: spfresh + SPTAGTest both build clean against system
protobuf 3.12.4 / gRPC.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two distributed-mode improvements bundled together:

1) SPFRESH_SHARD_STRIDE env-var gated stride / odd-even sharding in
   SPFreshTest.cpp (RunBenchmark + RunWorker). When the env var is set
   to '1'/'true', each node loads the FULL per-iteration batch and
   keeps only rows where (rowIdx % numNodes) == nodeIndex, instead of
   the default contiguous slice [n*B/N, (n+1)*B/N).
   Motivation: in v18 we observed driver=71 vs worker=2 layer-0
   splits despite a uniform consistent-hash ring. The contiguous
   slicing of the dataset (which on SIFT-like data has spatial
   locality) likely lands similar vectors on the same node, packing
   a few RNG-selected layer-0 heads to overflow on driver and not on
   worker. Stride sharding interleaves the data and lets us A/B test
   whether the layer-0 skew (and its layer-1 cascade asymmetry)
   originate from data ordering.

   The total number of vectors inserted across all nodes per
   iteration is unchanged. Recall measurement is unaffected because
   it indexes the dataset and ground truth, not the insert order.

2) docs/distributed-flow-diagrams.md (547 lines, 9 mermaid diagrams)
   covering: end-to-end build + insert + search flows, head-sync
   broadcast + ring-update handshake, per-process AddIndex bulk feed,
   the layer-0 vs layer-1 ownership matrix, MergePostings + the
   Bug 11 owner gate, and Split internals through QueueRemoteAppend
   (no synchronous flush in the critical section).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Raw FNV-1a on tiny nodeIndex (1, 2, 3) produces a pathologically
biased ring: with 2 nodes × 150 vnodes the empirical key split is
71.9% / 28.1% — explaining the long-observed driver/worker layer-0
imbalance (driver 1.55M layer-0 appends vs worker 95k after 290k/50k
bulk inserts in v19) where driver finishes ~5× faster than worker
even though both are given equal stride-sharded data.

Pre-mix nodeIndex through Knuth's golden-ratio multiplier (2654435761)
before XOR-ing it into the vnode component, so small node IDs become
full-spectrum uint32 values that distribute evenly through FNV.

Validated:
  K=2 nodes (vnodes=150): 52.8% / 47.2%
  K=3 nodes (vnodes=150): 37 / 34 / 29 %
  K=4 nodes (vnodes=150): 25 / 24 / 22 / 29 %
  K=8 nodes (vnodes=150): 10.7 .. 14.3 %

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Routing-focused companion to distributed-flow-diagrams.md. Documents how
the four background jobs (Append, Split, Merge, Reassign) and the two
top-level operations (Search, Insert) reduce to a single primitive:
WorkerNode::GetOwner(headID) → ConsistentHashRing lookup.

Covers:
  - Per-job ownership rules (Split/Merge owner-local; Append/Reassign
    per-element routing; Search scatter-gather via FetchPostings).
  - Communication substrate (NetworkNode/DispatcherNode/RemotePostingOps)
    and which thread pools serve each RPC type.
  - Callback wiring (SetAppendCallback, SetHeadSyncCallback,
    SetRemoteLockCallback, SetFetchPostingsCallback) registered through
    ExtraDynamicSearcher::SetWorker.
  - End-to-end mermaid trace of one inserted vector triggering a split
    and a reassign across two nodes.
  - Bug 27 ring-skew fix and its empirical validation matrix (2/3/4/8 node).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address feedback: doc previously jumped into mechanics without
establishing motivation. Add §0 covering:

  0.1 What problem are we solving — three single-node walls
      (storage, insert throughput, search tail) and three explicit
      non-goals (no replication, no distributed transactions, no
      live re-sharding).
  0.2 Why shard by headID — comparison vs per-vector / per-query /
      per-bucket alternatives. Posting is the natural unit of
      read/write/storage; head index stays cheap to replicate.
  0.3 Why a consistent hash ring (not modulo) — minimal remap on
      bring-up, vnodes for small-N smoothing, sensitivity to vnode
      hash quality (Bug 27).
  0.4 Operations and jobs at a glance — moved the original short
      list of Search/Insert/Append/Split/Merge/Reassign here.

Existing §1..§8 unchanged in content; numbering preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Worker side observed 3,873 "MergePostings: abandoning headID … after
64 retries" warnings during v20 BATCH 1 (39% of submitted merges
abandoned), driver side had zero. Root cause:

* The receive-side BatchAppend handler runs Append() with
  m_rwLocks[headID] held during a TiKV Get+Put round-trip
  (tens to hundreds of ms in distributed mode).
* MergePostings tries to acquire m_rwLocks[candidate] with a
  non-blocking try_lock and re-queues on failure.
* The merge re-queue cycle through m_splitThreadPool is much
  shorter than the Append hold time, so the candidate is
  essentially never observed unlocked within the 64-retry budget.
* Driver doesn't see this because it doesn't process incoming
  BatchAppend RPCs at the same volume — appends to driver-owned
  heads come from local SPFreshTest threads, which release the
  lock between the Get+Put RTTs of bulk inserts.

Fix: use try_lock_for(50ms) so a single attempt can absorb one
Append RTT instead of bouncing through the thread pool. Total
worst-case wall time per merge stays bounded at 64 * 50ms = 3.2s,
preserving the original anti-livelock property.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The stride-shard env var was being read on the driver but the SSH
command line for workers wasn't forwarding it, so workers always saw
stride=0. Add explicit forwarding (defaulting to 0 when unset) for
both the build-receiver and run paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants