[Feature] Add Distributed Posting Router for SPANN#448
Draft
TerrenceZhangX wants to merge 95 commits intousers/qiazh/pre-merge-tikv-bugfixfrom
Draft
[Feature] Add Distributed Posting Router for SPANN#448TerrenceZhangX wants to merge 95 commits intousers/qiazh/pre-merge-tikv-bugfixfrom
TerrenceZhangX wants to merge 95 commits intousers/qiazh/pre-merge-tikv-bugfixfrom
Conversation
…c, benchmarks Core routing (PostingRouter.h): - Hash routing: GetOwner uses headID %% NumNodes for deterministic assignment - RemoteLock RPC for cross-node Merge serialization (try_lock + retry) - BatchAppend, HeadSync, InsertBatch packet types and handlers - TCP-based server/client for inter-node communication ExtraDynamicSearcher.h integration: - EnableRouter/AdoptRouter for index lifecycle management - Split: BroadcastHeadSync after creating/deleting heads - MergePostings: Cross-node lock for neighbor headID on different node - MergePostings: BroadcastHeadSync for deleted head after merge - Reassign: Route Append to owner node + FlushRemoteAppends - AddIndex: Route appends to owner node via QueueRemoteAppend - SetHeadSyncCallback: Wire up HeadSync + RemoteLock callbacks Infrastructure: - IExtraSearcher/Index/VectorIndex: Add routing virtual method chain - Options/ParameterDefinitionList: RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores config params - CMakeLists: Link Socket sources and Boost into SPTAGLibStatic - Connection.cpp: Safe remote_endpoint() with error_code (no throw) - Packet.h: Append, BatchAppend, InsertBatch, HeadSync, RemoteLock types - SPFreshTest.cpp: ApplyRouterParams, FlushRemoteAppends, WorkerNode test - Benchmark configs: 100k/1m/10m x 1/2/3 node - run_scale_benchmarks.sh: Automated benchmark runner - docker/tikv: TiKV cluster docker-compose + config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix index dir creation: create parent dir only (not spann_index subdir) so the build code creates spann_index and the safety check passes - Clear checkpoint after build phase so driver re-runs all insert batches - Add VectorIndex::AdoptRouter to transfer router between batch clones instead of creating new TCP server per batch (port conflict fix) - Fix ExtraDynamicSearcher::AdoptRouter to override IExtraSearcher interface 100k results (routing works, ~50/50 local/remote split): 1-node steady-state: 90.2 vps 2-node steady-state: 80.5 vps 3-node steady-state: 77.7 vps No scaling at 100k due to small batch size (100 vectors) and shared TiKV. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 slides covering: problem statement, solution architecture, full flow comparison (single vs distributed), hash routing, append write path, split/merge/reassign routing, HeadSync broadcast, 3-node sequence diagram, design decisions, network protocol, config, and 100k results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix ini thread params: replace NumThreads with NumSearchThreads/NumInsertThreads - Add 100M benchmark ini files (1-node, 2-node, 3-node) - Update data paths to /mnt/data_disk/sift1b in all ini files - Add BENCHMARK_GUIDE_SCALE_COMPUTE.md (English) - Add BENCHMARK_RESULTS_SCALE_COMPUTE.md with 100K/1M/10M results - Update docker-compose and tikv.toml for 3-PD/3-TiKV cluster - Update run_scale_benchmarks.sh with multi-scale orchestration - Add .gitignore entries for generated benchmark artifacts
- Fix FullSearch routing for multi-node search (per-node build) - Update 10M benchmark: insert throughput 2-node 1.65x, 3-node 1.98x - Search latency 10M: 2-node -35%, 3-node -50% vs 1-node - Near-linear insert scaling across all data sizes (100K, 1M, 10M) - Update benchmark configs, test harness, and scale benchmark script
- Delete all 24 benchmark INI files from Test/ - Replace section 5 in BENCHMARK_GUIDE_SCALE_COMPUTE.md with complete deterministic generation rules (scale table, topology rules, template, Python generator script) - INI files can be regenerated on demand via the guide
- Embed full docker-compose.yml and tikv.toml contents in BENCHMARK_GUIDE_SCALE_COMPUTE.md section 4.1 - Remove docker/tikv/ from git tracking (files stay on disk) - Use <NVME_DIR> placeholder instead of absolute paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug fixes: - SPANNIndex.cpp: Remove redundant SortResult() in FullSearchCallback that corrupted remote search results (heapsort on already-sorted data) - TestDataGenerator.cpp: Fix EvaluateRecall truth NN stride from 1 to K Feature: - SPFreshTest.cpp: Add BuildOnly parameter to skip insert batches Benchmark results (Float32/dim64, 10M scale): - 1-node: 93.8 vps, 2-node: 200.0 vps (2.13x), 3-node: 271.4 vps (2.89x) - Recall stable within each config after double-sort fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Measure RPC round-trip time (sendTime → future.get()) and assign per-query latency for remote search results. Previously p50/min latency showed 0 for remote queries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert TiKVVersionMap node-scoped prefix optimization (vc:{nodeIndex}:...)
that broke cross-node version check correctness. Version map now uses
shared namespace (vc:{layer}:...) so all nodes can read/write the same
version data. Node-scoped optimization deferred to future branch.
Also add explanatory comments for AdoptRouter and HeadSync broadcast.
Restore commented-out debug log lines.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These were only used by the reverted node-scoped version map. LocalToGlobalVID is kept (used in insert path for VID uniqueness). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertVectors now uses dual path: per-vector multi-threaded insert for single-node (original behavior), bulk AddIndex for router-enabled multi-node (amortizes RPC overhead via batched remote appends). Also add comment on AddIndex explaining caller-side shard partitioning and LocalToGlobalVID purpose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the original vidIdx counter loop for post-heap version filtering instead of the candidateIndices array approach. Both are functionally equivalent but the original pattern is simpler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore 'For quantized index, pass GetFeatureDim()' comment
- Remove else { func(); } branch that was not in the original code
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use ConvertToString(valueType) and INI parameters to build the perftest_* filenames, matching TestDataGenerator::RunLargeBatches convention. Moves filename construction out of the per-batch loop. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document the 3-file INI pattern for multi-node benchmarks:
- _build.ini: Rebuild=true, no Router (build phase)
- _driver.ini: Rebuild=false, Router enabled (driver/n0)
- _n{i}.ini: worker nodes (n1, n2, ...)
Update Python generator and shell script to match. Remove the
sed-based approach of patching _n0.ini at runtime.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Router parameters (RouterEnabled, RouterLocalNodeIndex, RouterNodeAddrs, RouterNodeStores) from DefineSSDParameter to DefineRouterParameter with its own [Router] section in: - ParameterDefinitionList.h: new DefineRouterParameter macro - Options.h: SetParameter/GetParameter handle 'Router' section - SPANNIndex.cpp: SaveConfig outputs [Router] block - SPFreshTest.cpp: read [Router] INI section, ApplyRouterParams uses 'Router' section - Benchmark guide: updated INI template and Python generator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Distributed scale results belong in BENCHMARK_RESULTS_SCALE_COMPUTE.md, not the single-node 10M results file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DefineRouterParameter was nested inside #ifdef DefineSSDParameter, causing router parameters to be silently ignored. Moved the #endif to the correct position so DefineRouterParameter is at top-level scope. Also removed a debug log line from the DefineRouterParameter macro in Options.h. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement benchmark-level query distribution where each node independently searches its contiguous partition of the query set, coordinated by barrier files (same mechanism as insert distribution). This replaces the previous RPC-based approach, eliminating RPC overhead and serial head search. Key changes in SPFreshTest.cpp: - BenchmarkQueryPerformance: partition queries across nodes, use barrier files for synchronization, compute QPS = totalQueries / max(wallTime) - WorkerNode: unified command loop handling both search and insert commands via shared index directory Results (10M Float32, 200 queries, TopK=5): - 1-node: 194 QPS baseline - 2-node: 404 QPS (2.08x speedup, super-linear due to cache effects) - 3-node: 488 QPS (2.52x speedup) - Insert scaling: 1→2→3 node = 119→211→314 vec/s (1.77x, 2.64x) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Fix remote lock leak in MergePostings (ExtraDynamicSearcher.h)
- RAII RemoteLockGuard ensures remote lock is released on all exit
paths (continue/return/exception), preventing distributed deadlock
2. Fix buffer overflow in BatchRouteSearch (SPANNIndex.cpp)
- Validate response array sizes before accessing result vectors
- Fall back to local search on size mismatch
3. Fix missing send-failure callback in SendRemoteLock (PostingRouter.h)
- Add failure callback to complete the promise on send error,
matching the pattern used by all other SendPacket call sites
- Prevents 5-second stall on every send failure
4. Normalize atomic operation in SendRemoteLock (PostingRouter.h)
- Change m_nextResourceId++ to fetch_add(1) for consistency
5. Fix uninitialized workerTime in barrier coordination (SPFreshTest.cpp)
- Initialize workerTime and validate ifstream read
- Skip worker timing on parse failure instead of using garbage value
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove BatchRouteSearch, SetFullSearchCallback, GetSearchNodeCount, and all supporting RPC infrastructure (SearchPostingRequest/Response, FullSearchBatchRequest/Response structs, search callbacks, handler methods, and related member variables) from PostingRouter, SPANNIndex, Index, VectorIndex, ExtraDynamicSearcher, SPFresh, and Packet. Tests use barrier-based distributed search exclusively; the RPC-based search routing is dead code. Existing SearchRequest/SearchResponse packet types are preserved as they are used by the pre-existing Aggregator/Client/Server code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical fixes: - Fix integer overflow in DecodeVectorSearchResponse (ExtraTiKVController.h) Prevents OOB reads from corrupt TiKV responses with large numResults. - Fix m_mergeJobsInFlight counter underflow in MergePostings retry paths (ExtraDynamicSearcher.h) Add increment before re-enqueued MergeAsyncJob to match the unconditional decrement in exec(). High fixes: - Add FlushRemoteAppends after Split reassignment (ExtraDynamicSearcher.h) Ensures queued remote appends are sent after CollectReAssign in Split(). - Fix data race on m_nodeAddrs in ConnectToPeer (PostingRouter.h) Snapshot address under m_connMutex before retry loop. - Fix BroadcastHeadSync reading m_nodeAddrs without lock (PostingRouter.h) Snapshot node count under m_connMutex before iterating. Medium fixes: - Fix m_storeToNodes race in AddNode - move inside m_connMutex scope. - Fix unvalidated entryCount in HandleHeadSyncRequest with buffer-end tracking to prevent overruns from corrupt packets. - Add buffer-end tracking in BatchRemoteAppendRequest::Read to catch overruns during per-item deserialization. - Make m_asyncStatus atomic to fix race between async jobs and Checkpoint. Use exchange() for atomic read-and-reset in Checkpoint. - Make shared ErrorCode ret atomic in LoadIndex and WriteDownAllPostingToDB parallel loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When SKIP_HEAD_BUILD=1 is set, the script checks if HeadIndex exists in the index directory. If found, uses RebuildSSDOnly=true to skip SelectHead+BuildHead phases (only rebuilds SSD postings). If HeadIndex is missing, falls back to full build automatically. This saves significant time on re-runs where only SSD postings need to be rebuilt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When NOCACHE=1, builds at large scales (e.g. 100M) are impractical because each split read goes through cold TiKV. With BUILD_WITH_CACHE=1, the build phase keeps TiKV block cache, OS page cache, and VersionCache enabled. Then before the search/insert phase the script swaps TiKV to tikv_nocache.toml (restarting only the TiKV containers, preserving data and PD state) and drops OS caches, so the measured search/insert phase runs fully cold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… workspace pool)
…ig override - CollectReAssign: use MultiScanPostings instead of MultiGet in multi-chunk mode so nearby-heads scanning reads actual posting data (not just count key) - SPFreshTest: read AppendThreadNum from [Benchmark] section and pass as ssdOverride so it overrides the hardcoded default (was using insertthread)
ExtraDynamicSearcher.h: - Split cluster ks ordering, head vector preservation - Multi-chunk MultiScanPostings in CollectReAssign - Error code propagation through PutPostingToDB / AddHeadIndex paths ExtraTiKVController.h: - Retry + region_error handling for Delete / Set/GetPostingCount / PutChunk / PutBaseChunk / ScanPosting / DeleteRange paths - Distinguish VectorNotFound from gRPC error
Add per-skip Warning log when a posting buffer contains a VID outside [0, m_versionMap->Count()). Helps diagnose data corruption / sentinel VIDs leaking into Reassign scan.
In CollectReAssign's nearby-heads scan (m_reassignK > 0), postings were read into p_exWorkSpace->m_pageBuffers and iterated via raw pointers into those buffers. tryBatchReassign -> RNGSelection -> SearchHeadIndex -> SearchDiskIndex -> searcher->SearchIndex(p_exWorkSpace, ...) reuses the SAME m_pageBuffers for its own MultiGet/MultiScanPostings, overwriting (or reallocating) the buffer the outer loop is mid-scan over. This produced "skip invalid VID" warnings whose tail bytes matched leftover/missing-key returns (0xFF/0xFE patterns). Fix: snapshot each nearby posting into a local std::string before iterating, mirroring the safe pattern already used by the postingLists loop above.
…ix; eval 2026-04-23 batch 1-3 results DIAG instrumentation (IExtraSearcher.h IndexStats): - 5 atomic log2 histograms (22 buckets, 1us-1s+ / 1B-1MB+): AppendLockWait, AppendGetUs, AppendPutUs, AppendPostingBytes, SplitLockWait - HistBucketOf / HistAdd / FormatHist helpers - PrintStat unconditionally emits 5 [DIAG] lines ExtraDynamicSearcher.h: - Append single-chunk RMW path: time lock-wait, Get, Put; record bytes - Split: time write-lock acquisition - AllFinished ALL DONE branch: per-layer [DIAG] dump - MergeAsyncJob re-queue: increment m_mergeJobsInFlight + m_totalMergeSubmitted (was missing, caused in-flight underflow + completed > submitted) Eval (evaluation/2026-04-23): - benchmark_spfresh_sift1b_v10_multichunk.ini (UseMultiChunkPosting=false, 16 insert threads) - benchmark_multichunk_20260424_023309.log (live run incl. batch 1-3 DIAG output) - output.json (per-batch search/insert metrics) - ANALYSIS_BATCH1-3.md: client vs TiKV-server latency; concludes server is healthy (raw_get/put avg <1ms, 0 stalls, 78 regions); client-side overhead is the bottleneck (10-37x gap, growing per batch)
…traDynamicSearcher (PR 447 follow-up)
…sdr (NVMe-only per protocol)
…s by suffix Upstream agent (eng-spfresh-worker) committed 4 sdr-host-specific config tweaks (paths to /mnt/nvme, TiKV memory to 40GB) plus refresh-reference.sh and run_perf_validation.sh helpers. Auto-merge took upstream's path/memory changes since our side had not modified these templates. To keep both environments working without fighting on every push: - *_sdr.ini / tikv_sdr.toml upstream's sdr-host config (saved) - *_template.ini / tikv.toml restored to .9/.7 paths + 200GB cache New files added: - configs/benchmark_insert_dominant_template.ini 10\u00d7 expansion workload - /cluster_2node.conf .9 driver + .7 worker Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rge-distributed-to-tikv Resolve conflicts preserving features from both sides: - HEAD: distributed worker/dispatcher/router, AsyncAppend, MergeAsyncJob(3-arg), m_asyncStatus, headVec layout with metaDataSize prefix, NumSearchDuringInsertThreads - qiazh: VersionReadPolicy, TiKV PutBaseChunkAndCount/AsyncGetPostingCounts batch RPCs, AddIndexAsync* (multi-chunk + single-key), Append RMW data-loss fix (distinguish Key_NotFound from errors), GetCachedPostingCount -1 sentinel, extensive DIAG stats and histograms (reassignJobLatency, queueDepth, splitNewHeadCount, MC histograms), batch barrier wait timing. Build verification: Release/SPTAGTest builds cleanly with -DTIKV=ON -DTBB=ON. Added absl_synchronization to SPTAGTest link deps (qiazh's main.cpp uses absl::SetMutexDeadlockDetectionMode). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore evaluation/2026-04-23/* to qiazh's pre-merge-tikv-bugfix versions. - Drop zhangt-only evaluation/2026-04-20-merge/ and evaluation/backend_comparison/. - Trim evaluation/distributed/ to only the Insert dominant benchmark essentials (README.md, run_distributed.sh, configs/benchmark_insert_dominant_template.ini). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dfc4c89 to
a158014
Compare
Template (benchmark_insert_dominant_template.ini):
- Align knobs with qiazh's benchmark_spfresh_sift1b_v10_multichunk.ini:
AppendThreadNum 48->144 (I/O-bound async append, ~3x cores for higher
in-flight TiKV RPC concurrency), add PostingPageLimit=8,
PostingCountCacheCapacity=1000000, DistributedVersionMap=true.
- Drop deprecated VersionCacheTTLMs (qiazh removed in this branch).
- Default data paths to /mnt/nvme/sift1b/{bigann_base.u8bin,query.10K.u8bin}
(matches our mounted RAID); TruthPath=truth (auto-generated by SPTAGTest).
- Add TiKVPDAddresses=PLACEHOLDER for multi-machine fill-in by run_distributed.sh.
README:
- Reframe around the multi-machine workflow (single-machine multi-process
scripts are no longer included).
- Adopt qiazh's TiKV operational structure: prerequisites, per-node start,
health check, optional pre-split & scatter (adapted to per-node independent
PDs since each machine runs its own TiKV instead of one shared 3-PD cluster),
key-knob table, and per-batch output JSON structure.
- Preserve our distributed-specific sections: architecture diagram, TiKV
deployment model (independent per-node, no cross-node Raft), dispatch
protocol, run_distributed.sh subcommands and env overrides, troubleshooting.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In distributed mode, each worker process needs to connect to its own local PD/TiKV instance. The driver writes the per-process TiKVPDAddresses into ssdOverrides (indexed by WORKER_INDEX), but RunWorker was calling VectorIndex::LoadIndex(path, index) without applying those overrides — so the worker inherited the driver's PD address from the saved indexloader.ini and silently routed every KV write back to the driver's TiKV regardless of the consistent-hash routing decision. Fix: add a LoadIndex overload that takes a parameter overrides map and applies it between LoadIndexConfig and LoadIndexData, ensuring m_options reflects worker-local settings before PrepareDB constructs the TiKVIO connection. RunWorker now calls this overload with ssdOverrides. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hput fixes
Brings the `users/zhangt/merge-distributed-to-tikv` branch to a working
2-node sharded SPTAG-on-TiKV insert-dominant pipeline. Fixes 26 bugs
catalogued in `docs/distributed-bugs.md`; this is the consolidated landing
commit for review.
High-level architecture
-----------------------
- Dispatcher / Driver / Worker tri-process layout (1 dispatcher + N workers).
Driver is also worker 0; remote workers are 1..N-1.
- Routing: consistent-hash ring keyed by headID. Each posting has exactly
one owner node. Head index is replicated to all nodes via HeadSync.
- Build: split workers route postings to owner via PutPosting; receivers
write to local TiKV.
- Insert: AppendCallback splits/merges locally; cross-node forward via
BatchRemoteAppend; receiver runs the same callback.
- Search: scatter-gather. SearchIndex groups candidates by owner,
RemoteFetchPostings pulls non-local postings, results merged at driver.
- Per-node TiKV cluster + per-node `TiKVVersionMap` (no shared state).
Key fixes in this drop (see docs for full per-bug detail)
---------------------------------------------------------
Build / resume:
Bug 1 InitWorkSpace dispatch restored (AddIndex segfault)
Bug 14 Resume mode: reset m_topIndex per layer
Bug 15 Resume mode: HeadIDFile path uses prior layer suffix
Bug 16 Resume mode: don't double-translate p_headToLocal
Distributed insert correctness:
Bug 2 Insert path actually dispatches to workers when router enabled
Bug 5 BindWorkerToAllLayers (layer 1+ now has m_worker)
Bug 11 GetOwner gate on Split / MergePostings / Reassign read paths
Bug 17 True sharded build (PutPosting) + sharded search (FetchPostings)
Bug 18 Reassign no longer inlines FlushRemoteAppends (cross-node deadlock)
Bug 19 AppendCallback unconditionally injects head record (HeadSync race)
Bug 21 MergePostings owner check moved outside rwLock-bucket equality
TiKV / version map:
Bug 3 LoadIndex multi-chunk dimensioning
Bug 4 AddIDCapacity multi-node growth = numNodes * delta
Bug 13 AddBatchInitialDeleted bulk path (avoids O(N) RPC storm)
Bug 22 Distributed version map convergence: SetVersionBatch on
AppendCallback receiver and SearchIndex post-fetch loop
Networking / RPC robustness:
Bug 6 SafeSimpleReadBuffer (no more out-of-bounds reads on bad packets)
Bug 7 m_callbackLifetimeMutex protects callback invocation vs destruction
Bug 8 SendBatchRemoteAppend chunks to avoid 60s RPC timeout
Bug 10 FlushRemoteAppends serialised + bumped server thread / connection
pools (no thundering-herd 143-RPC parse-failed storm)
Bug 12 Search scatter-gather (was: single-node candidates only) [via 17]
Bug 20 ClearCallbacksIfOwner with owner-token (multi-layer co-existence)
Throughput / livelock under sustained load:
Bug 23 HandleBatchAppendRequest fan-out cap 16 -> 4 (TiKV chunk-mutex)
Bug 24 MergePostings retry bound + remote-bucket explicit defer
Bug 25 SetVersionBatch numNodes-aware bound + AddBatch fills 0xff
Bug 26 RemotePostingOps background executor: batch handlers no longer
occupy network-pool threads, freeing them for FetchPostings /
HeadSync (search recall protected from insert head-of-line).
Files
-----
Core changes are in:
AnnService/inc/Core/SPANN/Distributed/ (new sharded RPC plane)
AnnService/inc/Core/SPANN/ExtraDynamicSearcher.h
AnnService/inc/Core/Common/{IVersionMap,TiKVVersionMap}.h
AnnService/inc/Socket/ (defensive parsing, larger pool)
Test/src/SPFreshTest.cpp (BindWorkerToAllLayers, dispatcher timeout)
Configs:
evaluation/distributed/configs/cluster_2node.conf (new, was at root)
evaluation/distributed/configs/benchmark_insert_dominant_2node.ini (new)
evaluation/distributed/configs/tikv.toml (new)
Docs:
docs/distributed-bugs.md (full per-bug write-up + validation table)
Validation status
-----------------
v17: Bug 23/25 confirmed (recall 0.9520 batch 0; 0 OOR rejections;
0 PD reconnects). Recall 0.228 in batch 1 traced to FetchPostings
60s timeouts caused by network-thread starvation -> Bug 26 fix.
v18: pending — verify Bug 26 brings recall back to 0.94.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pulls in 6 upstream commits: - 93764e0 Optimize TiKV insert RPC and version map batching - ac465eb Add AddIndex lock timing diagnostics - fe0fe89 Optimize TiKV async posting RPCs - 9408d96 Add layer0-only version map search option - 98d7421 Refine TiKV version map search checks - 87bb2f1 Refine TiKV version map cache handling Conflicts resolved while preserving Bug 11/19/23-26 distributed invariants in ExtraDynamicSearcher.h. IVersionMap.h: kept BOTH our AddBatchInitialDeleted (used by distributed AddIDCapacity with capa*numNodes growth) AND upstream's AddBatch(num, deleted) overload. TiKVVersionMap.h: took upstream's CacheFresh/CacheTTL (87bb2f1) + striped refresh mutexes; kept our AddBatchInitialDeleted bulk-tombstone override; auto-merged AddBatch(num, deleted) override (93764e0). ExtraDynamicSearcher.h: kept ours for AddIDCapacity (distributed numNodes-aware growth + AddBatchInitialDeleted); auto-merged everything else (kMaxMergeRetries=64, m_remoteBucketLocked, m_worker->GetOwner gates, SetVersionBatch in AppendCallback / SearchIndex remote-fetch loop, upstream's bool p_checkVersionMap parameter, ShouldCheckVersionMapInSearch, fe0fe89's MultiGetWithStatus/MultiPutWithStatus optimization). Build verified: spfresh + SPTAGTest both build clean against system protobuf 3.12.4 / gRPC. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two distributed-mode improvements bundled together: 1) SPFRESH_SHARD_STRIDE env-var gated stride / odd-even sharding in SPFreshTest.cpp (RunBenchmark + RunWorker). When the env var is set to '1'/'true', each node loads the FULL per-iteration batch and keeps only rows where (rowIdx % numNodes) == nodeIndex, instead of the default contiguous slice [n*B/N, (n+1)*B/N). Motivation: in v18 we observed driver=71 vs worker=2 layer-0 splits despite a uniform consistent-hash ring. The contiguous slicing of the dataset (which on SIFT-like data has spatial locality) likely lands similar vectors on the same node, packing a few RNG-selected layer-0 heads to overflow on driver and not on worker. Stride sharding interleaves the data and lets us A/B test whether the layer-0 skew (and its layer-1 cascade asymmetry) originate from data ordering. The total number of vectors inserted across all nodes per iteration is unchanged. Recall measurement is unaffected because it indexes the dataset and ground truth, not the insert order. 2) docs/distributed-flow-diagrams.md (547 lines, 9 mermaid diagrams) covering: end-to-end build + insert + search flows, head-sync broadcast + ring-update handshake, per-process AddIndex bulk feed, the layer-0 vs layer-1 ownership matrix, MergePostings + the Bug 11 owner gate, and Split internals through QueueRemoteAppend (no synchronous flush in the critical section). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Raw FNV-1a on tiny nodeIndex (1, 2, 3) produces a pathologically biased ring: with 2 nodes × 150 vnodes the empirical key split is 71.9% / 28.1% — explaining the long-observed driver/worker layer-0 imbalance (driver 1.55M layer-0 appends vs worker 95k after 290k/50k bulk inserts in v19) where driver finishes ~5× faster than worker even though both are given equal stride-sharded data. Pre-mix nodeIndex through Knuth's golden-ratio multiplier (2654435761) before XOR-ing it into the vnode component, so small node IDs become full-spectrum uint32 values that distribute evenly through FNV. Validated: K=2 nodes (vnodes=150): 52.8% / 47.2% K=3 nodes (vnodes=150): 37 / 34 / 29 % K=4 nodes (vnodes=150): 25 / 24 / 22 / 29 % K=8 nodes (vnodes=150): 10.7 .. 14.3 % Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Routing-focused companion to distributed-flow-diagrams.md. Documents how
the four background jobs (Append, Split, Merge, Reassign) and the two
top-level operations (Search, Insert) reduce to a single primitive:
WorkerNode::GetOwner(headID) → ConsistentHashRing lookup.
Covers:
- Per-job ownership rules (Split/Merge owner-local; Append/Reassign
per-element routing; Search scatter-gather via FetchPostings).
- Communication substrate (NetworkNode/DispatcherNode/RemotePostingOps)
and which thread pools serve each RPC type.
- Callback wiring (SetAppendCallback, SetHeadSyncCallback,
SetRemoteLockCallback, SetFetchPostingsCallback) registered through
ExtraDynamicSearcher::SetWorker.
- End-to-end mermaid trace of one inserted vector triggering a split
and a reassign across two nodes.
- Bug 27 ring-skew fix and its empirical validation matrix (2/3/4/8 node).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address feedback: doc previously jumped into mechanics without
establishing motivation. Add §0 covering:
0.1 What problem are we solving — three single-node walls
(storage, insert throughput, search tail) and three explicit
non-goals (no replication, no distributed transactions, no
live re-sharding).
0.2 Why shard by headID — comparison vs per-vector / per-query /
per-bucket alternatives. Posting is the natural unit of
read/write/storage; head index stays cheap to replicate.
0.3 Why a consistent hash ring (not modulo) — minimal remap on
bring-up, vnodes for small-N smoothing, sensitivity to vnode
hash quality (Bug 27).
0.4 Operations and jobs at a glance — moved the original short
list of Search/Insert/Append/Split/Merge/Reassign here.
Existing §1..§8 unchanged in content; numbering preserved.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Worker side observed 3,873 "MergePostings: abandoning headID … after 64 retries" warnings during v20 BATCH 1 (39% of submitted merges abandoned), driver side had zero. Root cause: * The receive-side BatchAppend handler runs Append() with m_rwLocks[headID] held during a TiKV Get+Put round-trip (tens to hundreds of ms in distributed mode). * MergePostings tries to acquire m_rwLocks[candidate] with a non-blocking try_lock and re-queues on failure. * The merge re-queue cycle through m_splitThreadPool is much shorter than the Append hold time, so the candidate is essentially never observed unlocked within the 64-retry budget. * Driver doesn't see this because it doesn't process incoming BatchAppend RPCs at the same volume — appends to driver-owned heads come from local SPFreshTest threads, which release the lock between the Get+Put RTTs of bulk inserts. Fix: use try_lock_for(50ms) so a single attempt can absorb one Append RTT instead of bouncing through the thread pool. Total worst-case wall time per merge stays bounded at 64 * 50ms = 3.2s, preserving the original anti-livelock property. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The stride-shard env var was being read on the driver but the SSH command line for workers wasn't forwarding it, so workers always saw stride=0. Add explicit forwarding (defaulting to 0 when unset) for both the build-receiver and run paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.