Skip to content

fix: speed up splitLimitR long delimiter search#1063

Draft
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:hepin/perf-split-limit-r-kmp
Draft

fix: speed up splitLimitR long delimiter search#1063
He-Pin wants to merge 2 commits into
databricks:masterfrom
He-Pin:hepin/perf-split-limit-r-kmp

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Motivation

std.splitLimitR can spend too much time searching for long multi-character delimiters with repeated prefixes. No-match inputs such as a long string of a characters with delimiter aaaaaaaaab repeatedly re-check the same characters.

Modification

  • Keep the existing direct search path for common short delimiter cases.
  • Add a bounded direct-probe phase for multi-character delimiters.
  • Fall back to reverse KMP for worst-case long-delimiter searches.
  • Add edge-case coverage and a regression bench fixture for the long-delimiter no-match case.

Result

Case go-jsonnet v0.22.0 jrsonnet 0.5.0-pre99 sjsonnet before sjsonnet after
std.length(std.splitLimitR(importstr "long-a.txt", "aaaaaaaaab", 1)) 1 1 1, slower repeated probing 1, reverse-KMP fallback
Long no-match delimiter search correct result correct result worst-case approaches O(N*M) per lookup worst-case O(N+M) per lookup

The PR keeps the observable split result unchanged and improves the pathological search path.

Risks

  • The new fallback adds algorithmic complexity to std.splitLimitR.
  • Unusual delimiter patterns should remain covered by the direct-probe path and regression fixture.

Motivation:
std.splitLimitR scanned backward with repeated prefix comparisons, so long delimiters with shared prefixes could degrade to O(N*M) per lookup and show release-regression risk versus 0.6.3.

Modification:
Add a bounded direct-probe fast path plus reverse-KMP fallback for multi-character right splits, keep the zero-split fast path O(1), and add regression coverage/bench data.

Result:
New split_limit_r_long_delim JMH case improves from 12.636 ms/op on master and 5.087 ms/op on 0.6.3 to 1.522 ms/op. Guardrail reverse delimiter stays at parity with master.
@He-Pin He-Pin force-pushed the hepin/perf-split-limit-r-kmp branch from 96c032e to 695f808 Compare June 28, 2026 21:48
@He-Pin He-Pin marked this pull request as draft June 28, 2026 22:20
@He-Pin He-Pin marked this pull request as ready for review June 28, 2026 22:24
@He-Pin He-Pin marked this pull request as draft June 28, 2026 22:45
Motivation:

std.splitLimitR should avoid unnecessary reverse-KMP setup when a delimiter cannot fit, and the optimized path needs direct coverage for the KMP fallback.

Modification:

Return the original string before bounded/KMP setup when the delimiter is longer than the input. Add splitLimitR tests for the long-delimiter case, the unbounded maxsplits path, and a case that forces the KMP fallback after naive probes.

Result:

Preserves splitLimitR semantics while removing avoidable work and pinning the optimized fallback paths with tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant