Add tests for StandardHtmlEncodingDetector content-encoding, EncodingResult fields, and markLimit by vasiliy-mikhailov · Pull Request #2917 · apache/tika

vasiliy-mikhailov · 2026-06-30T04:20:24Z

Adds four JUnit 5 tests to StandardHtmlEncodingDetectorTest covering the charset-from-content-encoding detection path, full EncodingResult field verification (charset, confidence, label, DECLARATIVE result type), the default markLimit (65536), and custom setMarkLimit behavior including a meta tag placed beyond the configured limit. The tests reuse the existing assertCharset/detectCharset helpers and exercise previously uncovered branches in the detector.

metric	before	after
mutation score	80%	93%
test methods added	n/a	4

The additions are append-only (no existing test is modified) and pass against the current code.

How this was produced

This PR was generated with an AI-assisted pipeline built around mutation testing (PIT). The pipeline mutates the target class (flipping conditions and changing boundary/edge cases) and runs the existing tests against each mutant. Where a mutant survives (the existing tests do not catch that edge case), it writes a focused test for that case and reruns PIT to confirm the new test actually kills that specific mutant. So every added test is verified to catch a concrete edge case the suite missed before, rather than being speculative or redundant. The change is additive only (no production code modified), and the module builds green under its CI JDK.

…Result fields, and markLimit

tballison · 2026-06-30T10:57:22Z

From claude's review:

  1. The comments are mutation-tooling exhaust and will rot. Lines like // kills the surviving mutants on lines 70-71 (EQUAL_ELSE + removed call...), // 
  InlineConstant (1.0 -> 2.0) ... on lines 81-82, // line 94 (NO_COVERAGE) describe why PIT generated the test and hard-code production line numbers. The moment
  StandardHtmlEncodingDetector.java shifts by a line, those comments are wrong. I'd ask the contributor to rewrite them to state the behavior under test (e.g.
  "charset comes from Content-Encoding when Content-Type is absent") and drop the mutant/line-number references entirely.
  2. customMarkLimit's comment is slightly off. "the meta tag beyond 100 bytes won't be found" — the tag actually starts at byte 80 (inside the limit); it's the
  charset value that gets truncated at byte 100. The test logic is correct; only the wording misleads.
  3. Minor/stylistic: assertEquals(1.0f, getConfidence()) has no delta. It compiles (JUnit 5 has the (float,float) overload) and passes because production
  hard-codes 1.0f, so it's fine — a purist might add a delta.

These make sense to me.

Thank you for opening this and improving our unit tests.

Add tests for StandardHtmlEncodingDetector content-encoding, Encoding…

0d8e339

…Result fields, and markLimit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for StandardHtmlEncodingDetector content-encoding, EncodingResult fields, and markLimit#2917

Add tests for StandardHtmlEncodingDetector content-encoding, EncodingResult fields, and markLimit#2917
vasiliy-mikhailov wants to merge 1 commit into
apache:mainfrom
vasiliy-mikhailov:add-StandardHtmlEncodingDetector-tests

vasiliy-mikhailov commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vasiliy-mikhailov commented Jun 30, 2026

Uh oh!

tballison commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants