fix(client): harden SFU reconnection and ICE-restart recovery by oliverlaz · Pull Request #2285 · GetStream/stream-video-js

oliverlaz · 2026-06-15T10:41:45Z

💡 Overview

Three reconnection-robustness changes for the core client.

1. Signal-close revival hardening (StreamSfuClient, Call). A wedged signal WebSocket can be detected as dead by two independent sources (the health watchdog and the transport close event), which on a stuck socket can arrive seconds apart. Revival is now driven by a one-shot latch so it fires at most once per client, triggers immediately on an unhealthy close instead of waiting for the (possibly long-delayed) close event, and ignores closes from a superseded client so a stale socket cannot drive a reconnect after a new client has swapped in.

2. Subscriber negotiation-failure recovery (Subscriber). Subscriber.negotiate reset isIceRestarting only as the last statement of its happy path, so a negotiation failure left the flag stuck at true and disabled all future automatic ICE-restart / reconnect handling on the subscriber. Negotiation is now wrapped in try/catch/finally so the flag always clears; on failure the remote description is rolled back when the peer connection is still awaiting an answer (have-remote-offer), the rollback is itself guarded so it can never mask the original error, the failure is traced, and the subscriber escalates via tryRestartIce so a failed restart falls through to a reconnect instead of being silently dropped.

3. Revert the stability-gated publisher fast-reconnect skip (Call, BasePeerConnection). main skips the publisher's ICE restart on a fast reconnect when the publisher PC still looks stable (isStable()), to avoid SDP/ICE churn on signal-WS-only drops. In practice that skip can strand a cratered send-side bandwidth estimate: a "stable" publisher PC never restarts ICE, so the network-route-change BWE reset never fires and the encoder stays bitrate-starved after a handover. This restores the earlier behavior (on a fast reconnect the publisher's ICE is restarted unconditionally whenever it is publishing) and removes the now-unused isStable() helper.

📝 Implementation notes

StreamSfuClient: handleWebSocketClose becomes notifySignalClose, guarded by a signalClosed one-shot latch; close() notifies on a non-clean close without waiting for the transport event.
Call.handleSfuSignalClose ignores closes from any client that is no longer the active sfuClient.
Call fast reconnect: back to restoreICE(sfuClient, { includeSubscriber: false }) (publisher restart governed by includePublisher's true default); the isStable()-gated skip is gone.
Subscriber.negotiate wrapped in try/catch/finally; guarded rollback on have-remote-offer; subscriber.negotiationFailed trace; tryRestartIce escalation on failure.
BasePeerConnection.isStable() removed (no remaining callers).
Tests: StreamSfuClient revival, Call superseded-client guard, Subscriber success / failure / rollback / rollback-failure branches; removed the isStable() and FAST-skip tests.

🎫 Ticket: https://linear.app/stream/issue/XYZ-123

📑 Docs: https://github.com/GetStream/docs-content/pull/

Add CandidatePairMonitor, which watches the publisher's BUNDLE-shared ICE transport for selected-candidate-pair changes (e.g. WiFi to LTE) and triggers an ICE restart so the SFU learns the new transport path. A settle window, opened on startup and on every restartIce(), absorbs the restart-induced candidate-pair churn so reconnect-driven restarts (Call.restoreICE) do not loop back into another restart. Covered by unit tests for the monitor and Publisher-level reconnect-interplay tests.

changeset-bot · 2026-06-15T10:41:48Z

⚠️ No Changeset found

Latest commit: aed0593

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-06-15T10:42:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

BasePeerConnection.isStable() is removed and replaced with isHealthy() (fails only on failed/closed states). Subscriber.negotiate gains try/catch/finally for SDP rollback and guaranteed isIceRestarting cleanup. StreamSfuClient introduces a one-shot notifySignalClose latch. Call.handleSfuSignalClose gains a superseded-client guard, and the FAST-reconnect ICE restore path is simplified.

Changes

ICE Resilience and Signal-Close Hardening

Layer / File(s)	Summary
BasePeerConnection: isStable removed, isHealthy added `packages/client/src/rtc/BasePeerConnection.ts`, `packages/client/src/rtc/__tests__/Publisher.test.ts`, `packages/client/src/rtc/__tests__/Subscriber.test.ts`	`isStable()` (required ICE connected/completed + PC connected) is deleted and replaced by `isHealthy()` which returns `false` only when ICE or the PC is in a `failed`/`closed` state. Corresponding `isStable` test cases are removed from Publisher and Subscriber test files.
Subscriber negotiate: SDP rollback and finally cleanup `packages/client/src/rtc/Subscriber.ts`, `packages/client/src/rtc/__tests__/Subscriber.test.ts`	`negotiate` is refactored with `try/catch/finally`: rolls back via `setRemoteDescription({ type: 'rollback' })` when `signalingState` is `have-remote-offer`, rethrows the original error, and always clears `isIceRestarting` in `finally`. A tracer event `subscriber.negotiationFailed` is emitted on `subscriberOffer` failure. A new `Subscriber negotiation` test suite covers all rollback and cleanup paths.
StreamSfuClient: one-shot notifySignalClose latch `packages/client/src/StreamSfuClient.ts`, `packages/client/src/__tests__/StreamSfuClient.test.ts`	Adds a private `signalClosed` latch and `notifySignalClose(reason)` that invokes `onSignalClose` at most once, clears timers, and is routed through both the WebSocket `close` handler and `close()`. Tests add an `emit` helper to `CapturingWebSocket`, wire `onSignalClose` into `buildSfuClient`, and verify deduplication across unhealthy-close and late transport-close scenarios.
Call: superseded-client guard, FAST reconnect, and tests `packages/client/src/Call.ts`, `packages/client/src/rtc/__tests__/Call.reconnect.test.ts`	`handleSfuSignalClose` gains an early return when the closing client is not the active `this.sfuClient`. The FAST-reconnect path simplifies to `restoreICE({ includeSubscriber: false })`, dropping the prior `publisherIsStable` conditional. `reconnect`'s skip-states guard reads `callingState` into a local variable. New tests verify superseded-client guard behavior; Firefox publisher tests add `degradationPreference: DegradationPreference.UNSPECIFIED` to match updated payloads.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

GetStream/stream-video-js#2237: Shares the Firefox Publisher changePublishQuality path where degradationPreference: DegradationPreference.UNSPECIFIED is now added to test payloads.
GetStream/stream-video-js#2241: Both PRs touch Publisher.test.ts changePublishQuality SFU event payloads with degradationPreference values.
GetStream/stream-video-js#2257: Both PRs modify Call.ts reconnect behavior by adjusting early-return guards and skip-state checks around CallingState.

Suggested reviewers

jdimovska

Poem

🐇 Hop, hop — the signal's closed, but only once I say!
No double-rings from sockets gone and superseded today.
The isHealthy check: just "failed" or "closed" shall fail,
And negotiate rolls back the offer if we bail.
In finally, the flag is cleared — come sun or stormy weather,
This rabbit wired resilience, now holding ICE together! 🧊

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title precisely captures the PR's main objective: hardening SFU reconnection and ICE-restart recovery through signal-close revival, subscriber negotiation fixes, and publisher ICE-restart behavior changes.
Description check	✅ Passed	The PR description comprehensively covers all required sections with clear detail on the three main changes, implementation approach, and ticket/docs references.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/ice-candidate-pair-monitor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ignal-close revival Replace the standalone CandidatePairMonitor with a reconnect-time check in BasePeerConnection/Publisher: snapshot the selected candidate pair when ICE drops to `disconnected`, and on recovery keep the scheduled ICE restart only if the path actually migrated (e.g. Wi-Fi to LTE). A same-path recovery cancels the restart, avoiding restart loops. Also harden signal-close revival in StreamSfuClient and Call: - add a one-shot latch so onSignalClose fires at most once per client, even when both the health watchdog and a delayed WebSocket close detect death - drive revival immediately on an unhealthy close instead of waiting for the transport close event, which a wedged socket may delay for seconds - ignore signal closes from a superseded client so a stale socket's late close cannot drive a reconnect after a new client has swapped in

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/client/src/rtc/BasePeerConnection.ts (1)

477-482: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear stale timeout state after the callback runs to prevent duplicate restarts.

At Line 477, the timeout handle is never reset when the callback fires. Later, the recovery path at Line 499 can still treat that stale handle as “scheduled” and call shouldKeepScheduledIceRestart()/tryRestartIce() again after a restart already ran.

Suggested fix

         clearTimeout(this.iceRestartTimeout);
         this.iceRestartTimeout = undefined;
         this.iceRestartTimeout = setTimeout(() => {
+          this.iceRestartTimeout = undefined;
           const currentState = this.pc.iceConnectionState;
           if (currentState === 'disconnected' || currentState === 'failed') {
             this.tryRestartIce();
           }
         }, this.iceRestartDelay);

Also applies to: 499-505

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/BasePeerConnection.ts` around lines 477 - 482, The
iceRestartTimeout handle is assigned at line 477 but never cleared after the
timeout callback executes, causing the stale handle to appear "scheduled" when
checked later at lines 499-505 in the recovery path. This can trigger duplicate
ICE restart attempts. Fix this by resetting this.iceRestartTimeout to null after
the callback logic completes inside the setTimeout body, ensuring that
subsequent calls to shouldKeepScheduledIceRestart() or tryRestartIce() recognize
that no timeout is currently active. This same cleanup should be applied
wherever iceRestartTimeout callbacks are used to prevent similar stale-state
issues.

🧹 Nitpick comments (1)

packages/client/src/rtc/__tests__/Publisher.test.ts (1)
699-742: ⚡ Quick win

Add one regression test for “timer already fired, then connected” behavior.

The new suite is strong, but it doesn’t cover the case where the scheduled restart has already fired and ICE later transitions to connected/completed. Adding that case would prevent duplicate-restart regressions in this reconnect path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/__tests__/Publisher.test.ts` around lines 699 - 742,
Add a new test case after the existing three tests in the Publisher.test.ts
suite that verifies the "timer already fired, then connected" behavior. The test
should use vi.useFakeTimers(), call publishVideo() to capture the ICE transport,
set up a spy on the restartIce method, transition ICE from connected to
disconnected (which schedules a restart), advance timers to allow the scheduled
restart to execute, and then transition ICE back to connected or completed.
Assert that the restartIce method was called exactly once (from the scheduled
timer) and not again when ICE transitions to connected, ensuring no
duplicate-restart regression occurs in the reconnect path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@packages/client/src/rtc/BasePeerConnection.ts`:
- Around line 477-482: The iceRestartTimeout handle is assigned at line 477 but
never cleared after the timeout callback executes, causing the stale handle to
appear "scheduled" when checked later at lines 499-505 in the recovery path.
This can trigger duplicate ICE restart attempts. Fix this by resetting
this.iceRestartTimeout to null after the callback logic completes inside the
setTimeout body, ensuring that subsequent calls to
shouldKeepScheduledIceRestart() or tryRestartIce() recognize that no timeout is
currently active. This same cleanup should be applied wherever iceRestartTimeout
callbacks are used to prevent similar stale-state issues.

---

Nitpick comments:
In `@packages/client/src/rtc/__tests__/Publisher.test.ts`:
- Around line 699-742: Add a new test case after the existing three tests in the
Publisher.test.ts suite that verifies the "timer already fired, then connected"
behavior. The test should use vi.useFakeTimers(), call publishVideo() to capture
the ICE transport, set up a spy on the restartIce method, transition ICE from
connected to disconnected (which schedules a restart), advance timers to allow
the scheduled restart to execute, and then transition ICE back to connected or
completed. Assert that the restartIce method was called exactly once (from the
scheduled timer) and not again when ICE transitions to connected, ensuring no
duplicate-restart regression occurs in the reconnect path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91e76865-c64d-4dbb-88e3-72d1cb60aa47

📥 Commits

Reviewing files that changed from the base of the PR and between db5718e and fba9660.

📒 Files selected for processing (8)

packages/client/src/Call.ts
packages/client/src/StreamSfuClient.ts
packages/client/src/__tests__/StreamSfuClient.test.ts
packages/client/src/rtc/BasePeerConnection.ts
packages/client/src/rtc/Publisher.ts
packages/client/src/rtc/__tests__/Call.reconnect.test.ts
packages/client/src/rtc/__tests__/Publisher.test.ts
packages/client/src/rtc/__tests__/mocks/webrtc.mocks.ts

🚧 Files skipped from review as they are similar to previous changes (1)

packages/client/src/rtc/tests/mocks/webrtc.mocks.ts

Wrap Subscriber.negotiate in try/catch/finally so isIceRestarting is always cleared, even when negotiation throws. Previously the reset was the last statement of the happy path and was skipped on failure, leaving the flag stuck at true and disabling all future automatic ICE restart/reconnect handling on the subscriber. On failure, roll back the remote description when the peer connection is still awaiting an answer (have-remote-offer), and trace the negotiation failure. Add tests covering the success, failure, and rollback branches.

coderabbitai

🧹 Nitpick comments (1)

packages/client/src/rtc/Subscriber.ts (1)

204-211: 💤 Low value

Rollback failure could mask the original negotiation error.

If setRemoteDescription({ type: 'rollback' }) throws, the original err is never rethrown and its context is lost. Consider wrapping the rollback in a nested try/catch to preserve the original error for debugging:

Proposed fix to preserve original error

     } catch (err) {
       if (this.pc.signalingState === 'have-remote-offer') {
-        await this.pc.setRemoteDescription({ type: 'rollback' });
+        try {
+          await this.pc.setRemoteDescription({ type: 'rollback' });
+        } catch (rollbackErr) {
+          this.logger.warn('SDP rollback failed after negotiation error', rollbackErr);
+        }
       }
       throw err;
     } finally {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/client/src/rtc/Subscriber.ts` around lines 204 - 211, The catch
block in Subscriber.ts where setRemoteDescription with rollback is called can
mask the original negotiation error if the rollback itself throws an exception.
Wrap the setRemoteDescription({ type: 'rollback' }) call in a nested try/catch
block to handle any rollback failure separately, then ensure the original err is
still thrown after the finally block executes. This way, if the rollback fails,
its error can be logged or handled without losing the context of the original
negotiation error that triggered the rollback attempt.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/client/src/rtc/Subscriber.ts`:
- Around line 204-211: The catch block in Subscriber.ts where
setRemoteDescription with rollback is called can mask the original negotiation
error if the rollback itself throws an exception. Wrap the
setRemoteDescription({ type: 'rollback' }) call in a nested try/catch block to
handle any rollback failure separately, then ensure the original err is still
thrown after the finally block executes. This way, if the rollback fails, its
error can be logged or handled without losing the context of the original
negotiation error that triggered the rollback attempt.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb45dd31-7db5-4bc4-92ed-4ef3a77065b9

📥 Commits

Reviewing files that changed from the base of the PR and between fba9660 and 1d435f1.

📒 Files selected for processing (2)

packages/client/src/rtc/Subscriber.ts
packages/client/src/rtc/__tests__/Subscriber.test.ts

Reverts the organic ICE path migration detection added earlier on this branch: the onIceDisconnected/shouldKeepScheduledIceRestart hooks in BasePeerConnection and the candidate-pair snapshot/compare logic in Publisher. When ICE returns to connected after a disconnect, the scheduled restartICE is once again simply canceled, as on main. Also removes the now-obsolete candidate-pair migration tests and the ICE-transport mock helpers that only those tests used. BasePeerConnection, Publisher, and webrtc.mocks are now identical to main. The signal-close revival hardening in StreamSfuClient and Call is unrelated and stays.

…ver bandwidth Add CandidatePairMonitor, which watches the publisher's BUNDLE-shared ICE transport for selected-candidate-pair changes (e.g. WiFi to LTE) and restarts ICE so libwebrtc's network-route change re-seeds the send-side bandwidth estimate. Without this the estimate craters on a handover and ramps back over ~30s, starving the encoder and collapsing FPS. Validated via webrtc-internals: recovery in ~2-3s instead of ~30s. A change restarts ICE only when it moves to a new interface (keyed by network-id) while ICE is healthy; same-interface churn, unhealthy-ICE changes, and returns to a recently-used path (flap) are ignored. On React Native this consumes the selectedcandidatepairchange event from the GetStream react-native-webrtc fork and is a safe no-op where the API is unavailable.

On a failed subscriber negotiation, escalate via tryRestartIce so a failed ICE restart falls through to a reconnect instead of being silently logged in the dispatcher's catch. Also guard the renegotiation rollback so a rollback failure can no longer mask the original negotiation error; the original error always propagates now. Adds a regression test for the rollback-failure case.

Reverts the CandidatePairMonitor introduced in aed0593. Detecting an organic selected-candidate-pair migration and restarting ICE did not work well in practice: the restart churn and disconnect cascades on flaky handovers outweighed the bandwidth-recovery benefit.

Restore the bf837b1 behavior: on a fast reconnect, restart the publisher's ICE whenever it is publishing instead of skipping it when the publisher PC looks stable. The stability-gated skip could strand a crateted send-side bandwidth estimate, since a 'stable' publisher PC never restarts ICE and the route-change BWE reset never fires. Removes the now-unused isStable() helper and its tests.

The unhealthy-close timer fired at exactly two ping intervals (10s) after the last inbound message, coinciding with the second keep-alive ping. That left only the first probe's response a window to arrive, so a single lost or delayed healthCheckResponse could close a live signal socket and force an unnecessary reconnect. Extend the inactivity timeout to two ping intervals plus a 2s grace period (12s) so the second probe's response has time to land. It now takes two missed responses to trip the watchdog.

oliverlaz added 2 commits June 15, 2026 16:45

Merge branch 'refs/heads/main' into feat/ice-candidate-pair-monitor

bad047f

oliverlaz changed the title ~~feat(client): detect organic ICE path migrations and restart ICE~~ feat(client): restart ICE on organic path migration and harden signal-close revival Jun 16, 2026

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

oliverlaz changed the title ~~feat(client): restart ICE on organic path migration and harden signal-close revival~~ fix(client): harden SFU signal-close revival and subscriber ICE-restart recovery Jun 16, 2026

oliverlaz added 4 commits June 18, 2026 16:18

oliverlaz changed the title ~~fix(client): harden SFU signal-close revival and subscriber ICE-restart recovery~~ fix(client): harden SFU reconnection and ICE-restart recovery Jun 22, 2026

oliverlaz requested review from greenfrvr, jdimovska and santhoshvai June 22, 2026 12:08

oliverlaz added 3 commits June 22, 2026 14:14

chore: mark the connection as unhealthy after missing 2 ping probes

29f4c25

Merge branch 'main' into feat/ice-candidate-pair-monitor

c95f723

jdimovska approved these changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): harden SFU reconnection and ICE-restart recovery#2285

fix(client): harden SFU reconnection and ICE-restart recovery#2285
oliverlaz wants to merge 12 commits into
mainfrom
feat/ice-candidate-pair-monitor

oliverlaz commented Jun 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oliverlaz commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💡 Overview

📝 Implementation notes

Uh oh!

changeset-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oliverlaz commented Jun 15, 2026 •

edited

Loading

changeset-bot Bot commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading