Skip to content

Conversation

@KyleAMathews
Copy link
Contributor

@KyleAMathews KyleAMathews commented Nov 13, 2025

Investigated Discord bug report where collections stop reconnecting
after rapid tab switching in Firefox.

Root causes identified:
1. PRIMARY: Race condition in pause/resume state machine - resume()
   only checks for 'paused' state, but pause() sets 'pause-requested'
   as intermediate state. Rapid visibility changes cause stream to
   get stuck.

2. SECONDARY: Memory leak - visibility change listener is never
   removed, causing stale handlers and memory accumulation.

3. CONTRIBUTING: SSE fallback logic can be triggered unintentionally
   during rapid tab switching.

Firefox is more affected due to aggressive request abortion
(NS_BINDING_ABORTED) and timing differences in visibility API.

Includes recommended fixes for all identified issues.
Additional documentation files created during investigation:
- VISIBILITY_BUG_ANALYSIS.md: Deep dive with state diagrams
- VISIBILITY_BUG_SUMMARY.md: Concise summary with code snippets
- VISIBILITY_BUG_QUICK_REFERENCE.md: Quick lookup guide
Fixes race condition where rapid visibility changes cause streams to
get stuck and stop reconnecting, particularly in Firefox.

Primary fix - Race condition in pause/resume state machine:
- Modified #resume() to handle 'pause-requested' state in addition to
  'paused' state
- When resuming from 'pause-requested', set state back to 'active' to
  prevent the pause from completing
- This fixes the bug where rapid tab switching causes #resume() to be
  called before the abort completes, leaving the stream stuck

Secondary fix - Memory leak in visibility listener:
- Added #unsubscribeFromVisibilityChanges field to store cleanup function
- Modified #subscribeToVisibilityChanges() to store removeEventListener
  callback
- Modified unsubscribeAll() to call cleanup function
- Prevents memory leaks and stale event handlers

Root cause: When tab visibility changes rapidly (especially in Firefox):
1. Tab hidden → #pause() sets state to 'pause-requested'
2. Request is aborted (takes time to complete)
3. Tab visible again → #resume() called before state becomes 'paused'
4. Old #resume() only checked for 'paused', so it did nothing
5. Stream remained stuck in 'pause-requested' state

Firefox is more affected due to aggressive request abortion with
NS_BINDING_ABORTED and different timing in visibility API implementation.

Fixes: Collections getting stuck after rapid tab switching
Related: Discord bug report about collections not reconnecting
- Add changeset for collection stuck bug fix
- Remove investigation documentation files for cleaner PR
@codecov
Copy link

codecov bot commented Nov 13, 2025

Codecov Report

❌ Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.39%. Comparing base (c4d0ea4) to head (238008f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
packages/typescript-client/src/client.ts 72.72% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3431      +/-   ##
==========================================
- Coverage   69.54%   69.39%   -0.15%     
==========================================
  Files         182      182              
  Lines        9778     9787       +9     
  Branches      353      360       +7     
==========================================
- Hits         6800     6792       -8     
- Misses       2976     2993      +17     
  Partials        2        2              
Flag Coverage Δ
elixir 66.34% <ø> (-0.18%) ⬇️
elixir-client 74.47% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/typescript-client 93.67% <72.72%> (-0.15%) ⬇️
packages/y-electric 55.12% <ø> (ø)
postgres-140000 65.36% <ø> (-0.12%) ⬇️
postgres-150000 65.29% <ø> (-0.14%) ⬇️
postgres-170000 65.29% <ø> (-0.33%) ⬇️
postgres-180000 65.34% <ø> (?)
sync-service 65.53% <ø> (-0.20%) ⬇️
typescript 87.13% <72.72%> (-0.06%) ⬇️
unit-tests 69.39% <72.72%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Fixes an additional race condition where a stale abort completion
could overwrite the state after resume() has been called.

Timeline of the bug:
1. Request #1 running with AbortController #1
2. Tab hidden → pause() sets state to 'pause-requested', aborts #1
3. Tab visible → resume() sets state to 'active', starts request #2
4. Old request #1's abort completes, sets state to 'paused'
5. Stream stuck because state is 'paused' but should be 'active'

Fix: Only transition to 'paused' if state is still 'pause-requested'.
If resume() already changed it to 'active', don't overwrite it.

This ensures that old abort completions don't interfere with the
new active request started by resume().
- Add type assertion to handle concurrent state changes during async
  operations that TypeScript's flow analysis cannot detect
- Update changeset with detailed state machine explanation and both
  race condition fixes
- Add explanatory comments about why type assertion is necessary

TypeScript's control flow analysis sees state as 'active' after line
631, but doesn't account for the fact that the visibility change
handler can call #pause() during the await, changing state to
'pause-requested'. The type assertion tells TypeScript we're
intentionally checking the runtime value.
@kevin-dp
Copy link
Contributor

This fix looks good to me. It will handle the case where we quickly resume the stream before the pause fully finished. The old pause shouldn't interfere with the quick resume because of the extra check that's added in the if test. It would be nice to have preview packages in Electric such that the OP can test the fix and confirm it works before we actually merge it.

@KyleAMathews KyleAMathews merged commit b377010 into main Nov 13, 2025
44 checks passed
@KyleAMathews KyleAMathews deleted the claude/investigate-collection-stuck-bug-011CV6HmgS6rMVU2QCTMVKys branch November 13, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants