Skip to content

fix: handle ujson surrogate parse exceptions in JSON imports#1046

Closed
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/json-import-surrogate
Closed

fix: handle ujson surrogate parse exceptions in JSON imports#1046
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/json-import-surrogate

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

Importing malformed Unicode surrogate escapes from .json files can make ujson throw a plain java.lang.Exception. That previously escaped as an sjsonnet internal error instead of falling back to normal Jsonnet parsing and reporting a parse error.

Modification

  • Catch ujson's known bare java.lang.Exception surrogate parse failures in the .json import fast path.
  • Keep the catch narrow by requiring the exact exception class plus known surrogate error message prefixes.
  • Add regression coverage for low-surrogate, reversed-surrogate, high-before-non-low-escape, high-before-high-escape failures, and valid surrogate pairs.

Result

Case go-jsonnet v0.22.0 jrsonnet 0.5.0-pre99 sjsonnet before sjsonnet after
Import JSON containing \uDC00 as a lone low surrogate Truncated unicode surrogate pair escape sequence in string literal syntax error: invalid string escape Internal Error caused by Un-paired low surrogate sjsonnet.ParseError without Internal Error
Import JSON containing \uDE00\uD83D as a reversed surrogate pair parse error syntax error: invalid string escape Internal Error caused by Un-paired low surrogate sjsonnet.ParseError without Internal Error

Valid surrogate pairs still use the JSON import fast path.

Risks

  • The catch depends on ujson's current exception type and message prefixes for surrogate failures.
  • This does not add a separate pre-validation scan for every JSON import.

@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 22fe76f to 7ca9536 Compare June 25, 2026 14:58
@He-Pin He-Pin marked this pull request as ready for review June 25, 2026 15:13
@He-Pin He-Pin marked this pull request as draft June 25, 2026 15:13
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Jun 25, 2026
Motivation:
JSON imports and std.parseJson relied on ujson behavior for Unicode surrogate escapes. ujson can throw plain Exception for some malformed escapes and silently drop a lone high surrogate, so catching an exception message only fixed one symptom.

Modification:
Add a JsonUnicodeValidator that scans JSON string contents for invalid \uXXXX surrogate sequences before invoking ujson. Reuse it for .json import fast path and std.parseJson while preserving import fallback to normal Jsonnet parsing. Add regression coverage for lone high/low surrogates, reversed pairs, valid pairs, and escaped backslash literals.

Result:
Malformed surrogate escapes no longer surface as Internal Error or corrupt imported strings. Valid pairs and literal \u text continue to parse correctly across JVM and JS targeted tests.

References: databricks#1046
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 7ca9536 to 85788f3 Compare June 25, 2026 15:31
@He-Pin He-Pin changed the title fix: JSON import handles unpaired surrogate escapes without crashing fix: validate JSON surrogate escapes before parsing Jun 25, 2026
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Jun 25, 2026
Motivation:
Some malformed Unicode surrogate escapes in imported .json files make ujson throw a plain java.lang.Exception, which previously escaped as an Internal Error. Avoid adding a pre-parse scan on the normal import path.

Modification:
Extend the .json import fast-path fallback to recognize ujson's known surrogate parse exception messages and fall back to normal Jsonnet parsing. Keep validation delegated to ujson and do not change std.parseJson.

Result:
Malformed surrogate escapes that ujson reports no longer leak Internal Error stack traces. Valid surrogate pairs still parse on the JSON fast path. Cases ujson silently accepts or corrupts remain a parser limitation rather than being fixed with an extra scan.

References: databricks#1046
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from 85788f3 to 15a8173 Compare June 25, 2026 15:52
@He-Pin He-Pin changed the title fix: validate JSON surrogate escapes before parsing fix: handle ujson surrogate parse exceptions in JSON imports Jun 25, 2026
Comment thread sjsonnet/src/sjsonnet/Importer.scala
@He-Pin He-Pin marked this pull request as ready for review June 25, 2026 15:55
@He-Pin He-Pin closed this Jun 25, 2026
@He-Pin He-Pin reopened this Jun 25, 2026
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch 3 times, most recently from 1433a90 to e8016b8 Compare June 28, 2026 21:47
@He-Pin He-Pin force-pushed the fix/json-import-surrogate branch from e8016b8 to b557485 Compare June 28, 2026 22:16
@He-Pin He-Pin marked this pull request as draft June 28, 2026 22:43
@He-Pin He-Pin closed this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant