Skip to content

Conversation

@robacourt
Copy link
Contributor

@robacourt robacourt commented Dec 18, 2025

This PR is still in draft because I have some reservations about it:

  1. The proposed cause of the problem is not fully explained - how would the inner shape storage go missing?
  2. The solution does a ShapeDb.reduce over all the shapes potentially many times (up to how many levels of subqueries there are) - this could be slow with lots of shapes.

However this PR represents my current best guess as to how a Materializer went missing causing AutoArc's outage.

Summary

  • Validate dependency handles when restoring shapes from backup
  • Cascade removal to shapes that depend on removed shapes
  • Refactor restore_dependency_handles for clarity

Problem

Shapes with subqueries store handles to their dependency shapes in shape_dependencies_handles. When restoring shapes from storage via load_shapes, this list is rebuilt and validated by restore_dependency_handles, which removes shapes whose dependencies no longer exist.

However, when restoring from backup via load_backup, this validation was not performed. The backup contains serialized Shape structs with their shape_dependencies_handles already populated, but those handles may reference shapes that no longer exist (due to cleanup, schema changes, or prior removal).

Evidence from AutoArc's production logs

AWS CloudWatch logs from Dec 18 show a crash loop with this error:

GenServer.call({:via, Registry, {..., {Electric.Shapes.Consumer.Materializer, "32220858-1765808264363524"}}}, :get_link_values, 5000)
** (EXIT) no process: the process is not alive

Key observations:

  • Shape 32220858-1765808264363524 was created Dec 15 (timestamp embedded in handle)
  • On Dec 17 at 20:25:28, this shape's Materializer shut down (:shutdown reason)
  • 0.6 seconds later, the system loaded from backup
  • 126 shapes were restored, but 32220858-1765808264363524 was not among them
  • A parent shape still referenced it in shape_dependencies_handles
  • Every restart reloaded the same corrupted backup state, causing the same crash
  • No "Removing shape" logs for this handle exist—it was never explicitly removed, just lost

This is a classic stale reference: the dependency shape was removed/lost, but parent shapes restored from backup still held handles to it.

Fix

  1. After load_backup removes shapes without valid storage, call remove_shapes_with_invalid_dependencies to cascade removals to any shapes whose shape_dependencies_handles reference the removed handles
  2. Refactor restore_dependency_handles to also cascade removals (handles the load_shapes path)

@robacourt robacourt force-pushed the rob/fix-subquery-restore branch from 0e2a220 to 28da709 Compare December 18, 2025 17:50
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.79%. Comparing base (a7b6094) to head (bcd75af).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3628   +/-   ##
=======================================
  Coverage   87.79%   87.79%           
=======================================
  Files          18       18           
  Lines        1663     1663           
  Branches      415      415           
=======================================
  Hits         1460     1460           
  Misses        201      201           
  Partials        2        2           
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/typescript-client 93.76% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 87.79% <ø> (ø)
unit-tests 87.79% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Dec 19, 2025

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
at line 56 in shell psql/at line 56 in shell psql View Logs

Fix in Cursor

@robacourt robacourt force-pushed the rob/fix-subquery-restore branch from ec42b75 to bcd75af Compare December 19, 2025 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants