-
Notifications
You must be signed in to change notification settings - Fork 593
feat: prevent retries when terminal error happens #4258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
📝 WalkthroughWalkthroughIntroduce centralized terminal-vs-retriable build error classification and wrapping, add retry-wrapped DB ops, and refactor BuildKit status handling to buffer, sort, and emit chronological events via a larger status channel; update deploy workflow to treat terminal build errors as permanent failures. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BuildSvc as Build Service
participant StatusBuf as Status Buffer
participant DB
participant ErrorWrap as wrapBuildError
participant Restate
Client->>BuildSvc: Request CreateBuild
BuildSvc->>DB: WithRetryContext(Get/Update depot project)
DB-->>BuildSvc: OK / Err
BuildSvc->>BuildSvc: Start Solve / connect BuildKit
BuildSvc->>StatusBuf: Send status events (buffered, channel size = 1000)
StatusBuf->>StatusBuf: Accumulate events
StatusBuf->>StatusBuf: Sort events by timestamp
StatusBuf->>BuildSvc: Emit ordered vertex completions & logs
alt Any failure encountered
BuildSvc->>ErrorWrap: wrapBuildError(err, code, msg)
alt ErrorWrap returns TerminalError
ErrorWrap->>Restate: TerminalError (no retry)
Restate->>Client: Permanent failure
else Non-terminal
ErrorWrap->>Client: Connect error (retryable)
Client->>Restate: Retry via restate
end
else Build completes successfully
BuildSvc->>Client: Success
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thank you for following the naming conventions for pull request titles! 🙏 |
1b5b317 to
5476756
Compare
61c5223 to
190e6db
Compare
5476756 to
9639b01
Compare
190e6db to
32da06d
Compare
32da06d to
6170835
Compare
9639b01 to
a310e14
Compare
8750ee2 to
4046eef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
go/apps/ctrl/services/build/backend/depot/create_build.go(7 hunks)go/apps/ctrl/services/build/backend/depot/errors.go(1 hunks)go/apps/ctrl/workflows/deploy/deploy_handler.go(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3564
File: go/cmd/cli/commands/deploy/deploy.go:71-73
Timestamp: 2025-07-15T15:11:30.840Z
Learning: In the go/cmd/cli/commands/deploy/ CLI codebase, ogzhanolguncu uses a UI-layer error handling pattern where errors are displayed through ui.PrintError() and ui.PrintErrorDetails(), followed by returning nil to indicate the error has been handled at the UI level, rather than propagating errors up the call stack.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3564
File: go/cmd/cli/commands/deploy/deploy.go:153-158
Timestamp: 2025-07-16T09:18:45.379Z
Learning: In the go/cmd/cli/commands/deploy/ CLI codebase, ogzhanolguncu prefers to allow deployment to continue even when Docker push fails (around lines 153-158 in deploy.go) because the team is working locally and needs this behavior for local development workflows where registry access might not be available.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3297
File: apps/dashboard/lib/trpc/routers/authorization/roles/query.ts:210-323
Timestamp: 2025-06-04T20:13:12.060Z
Learning: The user ogzhanolguncu prefers explicit, duplicated code over abstracted helper functions when it improves readability, even if it means some duplication in filter building functions in the authorization roles query module.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 2825
File: apps/dashboard/app/(app)/logs-v2/hooks/use-bookmarked-filters.ts:0-0
Timestamp: 2025-01-30T20:51:44.359Z
Learning: The user (ogzhanolguncu) prefers to handle refactoring suggestions in separate PRs to maintain focus in the current PR.
📚 Learning: 2025-07-16T09:18:45.379Z
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3564
File: go/cmd/cli/commands/deploy/deploy.go:153-158
Timestamp: 2025-07-16T09:18:45.379Z
Learning: In the go/cmd/cli/commands/deploy/ CLI codebase, ogzhanolguncu prefers to allow deployment to continue even when Docker push fails (around lines 153-158 in deploy.go) because the team is working locally and needs this behavior for local development workflows where registry access might not be available.
Applied to files:
go/apps/ctrl/workflows/deploy/deploy_handler.go
📚 Learning: 2025-08-21T15:54:45.198Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3825
File: go/internal/services/usagelimiter/limit.go:38-0
Timestamp: 2025-08-21T15:54:45.198Z
Learning: In go/internal/services/usagelimiter/limit.go, the UpdateKeyCreditsDecrement operation cannot be safely wrapped with db.WithRetry due to the lack of idempotency mechanisms in the current tech stack. Retrying this non-idempotent write operation risks double-charging users if the first attempt commits but the client sees a transient error.
Applied to files:
go/apps/ctrl/services/build/backend/depot/create_build.go
🧬 Code graph analysis (1)
go/apps/ctrl/services/build/backend/depot/create_build.go (3)
go/pkg/db/retry.go (1)
WithRetryContext(36-46)go/pkg/db/project_find_by_id.sql_generated.go (1)
FindProjectByIdRow(31-44)go/pkg/db/project_update_depot_id.sql_generated.go (1)
UpdateProjectDepotIDParams(21-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Go API Local / Test
- GitHub Check: Test API / API Test Local
- GitHub Check: Build / Build
- GitHub Check: detect_changes / Detect Changes
- GitHub Check: autofix
🔇 Additional comments (10)
go/apps/ctrl/services/build/backend/depot/errors.go (2)
24-41: LGTM! Solid terminal error detection logic.The case-insensitive pattern matching approach is appropriate for error classification, and the nil check prevents panics.
43-51: LGTM! Error wrapping logic correctly distinguishes terminal vs. retriable errors.The function properly wraps errors and returns the appropriate type (Restate TerminalError or Connect error) based on the terminal status.
go/apps/ctrl/workflows/deploy/deploy_handler.go (1)
146-157: LGTM! Terminal error handling prevents unnecessary retries.The implementation correctly distinguishes between terminal and non-terminal build errors, preventing retry pollution in logs for permanent failures (broken Dockerfile, missing files, etc.) while maintaining retry behavior for transient issues. The detailed logging helps with debugging.
go/apps/ctrl/services/build/backend/depot/create_build.go (7)
104-104: LGTM! Consistent error wrapping.The switch to
wrapBuildErrorenables proper terminal error detection for build creation failures.
128-128: LGTM! Consistent error wrapping.The switch to
wrapBuildErrorenables proper terminal error detection for machine acquisition failures.
145-145: LGTM! Consistent error wrapping.The switch to
wrapBuildErrorenables proper terminal error detection for BuildKit connection failures.
176-176: LGTM! Consistent error wrapping.The switch to
wrapBuildErrorenables proper terminal error detection for build execution failures, which is the primary goal of this PR.
233-235: LGTM! Safe retry wrapper for read operation.Wrapping the
FindProjectByIdquery withWithRetryContextadds resilience for transient database errors. Read operations are naturally idempotent and safe to retry.
164-164: Unable to verify buffer size reduction claim; request manual verification.The review comment claims the buffer was reduced from 100 to 10, but I found no evidence of this in the codebase. The current code shows a buffer size of 10, and git history is unavailable to confirm the prior size.
While the concern about channel buffer capacity is theoretically valid—a small buffer can cause blocking if the producer (buildkitClient.Solve) generates updates faster than the consumer (processBuildStatus) can process them—I cannot determine from the codebase alone whether:
- The buffer size was actually 100 previously
- The typical rate of status updates exceeds the consumer's throughput
- 10 is insufficient for typical workloads
The processBuildStatus function appears to perform straightforward processing (iterating through logs and marking completed vertices), suggesting low latency per status update, but without metrics or load testing data, the adequacy of the buffer size cannot be definitively confirmed.
276-286: No issues found. The operation is safe as implemented.The
UpdateProjectDepotIDoperation is confirmed idempotent:
- The Depot project is already created via external API before the UPDATE (line 269)
- The UPDATE is a fixed-value SET operation, not an increment/decrement
- Retrying with the same arguments produces the same database state
- Unlike the learnings case (credits decrement), this has no quantitative or financial side effects
The retry wrapper is appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
go/apps/ctrl/services/build/backend/depot/create_build.go(10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/actions/setup-node/action.yaml:0-0
Timestamp: 2025-08-08T15:10:46.436Z
Learning: Repo: unkeyed/unkey — Preference: If imeyer comments “issue” on a PR thread, automatically open a thorough GitHub issue (Summary, Impact, Where, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References), include backlinks to the PR and comment, and assign to imeyer.
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/workflows/runbook-freshness-check.yaml:157-173
Timestamp: 2025-08-08T14:59:52.283Z
Learning: Repo unkeyed/unkey: When a CI/workflow fix is deferred, imeyer prefers a thorough GitHub issue be opened with sections (Summary, Impact, Where, Repro, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References) and assigned to imeyer, including backlinks to the originating PR and comment.
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/workflows/job_detect_changes.yaml:111-114
Timestamp: 2025-08-08T16:10:00.224Z
Learning: Repo: unkeyed/unkey — When imeyer writes “make me an issue” on a PR thread, automatically open a thorough GitHub issue (Summary, Impact, Where, Repro, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References), include backlinks to the PR and comment, and assign to imeyer.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3297
File: apps/dashboard/lib/trpc/routers/authorization/roles/query.ts:210-323
Timestamp: 2025-06-04T20:13:12.060Z
Learning: The user ogzhanolguncu prefers explicit, duplicated code over abstracted helper functions when it improves readability, even if it means some duplication in filter building functions in the authorization roles query module.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 2825
File: apps/dashboard/app/(app)/logs-v2/hooks/use-bookmarked-filters.ts:0-0
Timestamp: 2025-01-30T20:51:44.359Z
Learning: The user (ogzhanolguncu) prefers to handle refactoring suggestions in separate PRs to maintain focus in the current PR.
📚 Learning: 2025-08-21T15:54:45.198Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3825
File: go/internal/services/usagelimiter/limit.go:38-0
Timestamp: 2025-08-21T15:54:45.198Z
Learning: In go/internal/services/usagelimiter/limit.go, the UpdateKeyCreditsDecrement operation cannot be safely wrapped with db.WithRetry due to the lack of idempotency mechanisms in the current tech stack. Retrying this non-idempotent write operation risks double-charging users if the first attempt commits but the client sees a transient error.
Applied to files:
go/apps/ctrl/services/build/backend/depot/create_build.go
🧬 Code graph analysis (1)
go/apps/ctrl/services/build/backend/depot/create_build.go (5)
go/pkg/db/retry.go (1)
WithRetryContext(36-46)go/pkg/db/project_find_by_id.sql_generated.go (1)
FindProjectByIdRow(31-44)go/pkg/db/project_update_depot_id.sql_generated.go (1)
UpdateProjectDepotIDParams(21-25)go/apps/ctrl/services/build/backend/depot/service.go (1)
Depot(28-38)go/pkg/clickhouse/schema/types.go (2)
BuildStepV1(101-112)BuildStepLogV1(117-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test Go API Local / Test
- GitHub Check: Test API / API Test Local
- GitHub Check: Build / Build
🔇 Additional comments (6)
go/apps/ctrl/services/build/backend/depot/create_build.go (6)
36-38: Good addition with clear documentation.The increased buffer size and explanatory comment help prevent log loss for large Dockerfiles. The constant is well-documented.
109-109: Consistent error wrapping improves error handling.The centralized error wrapping via
wrapBuildErrorcorrectly classifies terminal vs. retriable build errors across all failure paths, aligning with the PR objective to prevent retries on terminal errors.Also applies to: 133-133, 150-150, 181-181
238-240: Appropriate use of retry wrapper for read operation.Wrapping the database query with
WithRetryContextis safe and correct, as read operations have no side effects and can be retried on transient failures.
281-291: Retry wrapper is safe here due to idempotent update.The retry wrapper is appropriate for this update operation because both
depot_project_idandupdated_atare computed before entering the retry block (line 280), making the operation idempotent. Multiple retries will set the same values without side effects.Based on learnings
304-308: Clean data structure for event ordering.The
eventstruct provides a clear abstraction for collecting and sorting BuildKit events chronologically, supporting the refactored log processing logic.
310-397: Well-structured refactor for chronological event processing.The refactored implementation correctly:
- Buffers all BuildKit events to handle out-of-order delivery
- Sorts events chronologically using stable sort for deterministic ordering when timestamps collide
- Prevents duplicate vertex processing via the
completedmap- Handles edge cases like cached steps with instant completion
The approach of buffering before processing addresses the core issue of BuildKit's asynchronous event delivery.
Note: Error handling for ClickHouse buffering operations (
BufferBuildStep,BufferBuildStepLog) is not present, but given this runs in a goroutine with no error propagation mechanism, this is a broader architectural consideration beyond this PR's scope.
|
@Flo4604 I tried to add our |
fine with me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
go/apps/ctrl/services/build/backend/depot/create_build.go (1)
168-169: Goroutine synchronization issue remains unaddressed.As previously noted, the goroutine processing build status logs runs asynchronously without synchronization.
CreateBuildmay return beforeprocessBuildStatusLogscompletes, potentially resulting in incomplete logs in ClickHouse.
🧹 Nitpick comments (1)
go/apps/ctrl/services/build/backend/depot/create_build.go (1)
34-37: Clarify the buffer drop behavior and sizing rationale.The comment states logs will be "dropped" if the buffer is too low. In standard Go channels, a full buffered channel blocks the sender rather than dropping messages. If BuildKit has specific drop behavior, this should be documented. Additionally, provide rationale for choosing 1000 as the buffer size (e.g., based on empirical testing with large Dockerfiles).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
go/apps/ctrl/services/build/backend/depot/create_build.go(9 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/workflows/runbook-freshness-check.yaml:157-173
Timestamp: 2025-08-08T14:59:52.283Z
Learning: Repo unkeyed/unkey: When a CI/workflow fix is deferred, imeyer prefers a thorough GitHub issue be opened with sections (Summary, Impact, Where, Repro, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References) and assigned to imeyer, including backlinks to the originating PR and comment.
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/actions/setup-node/action.yaml:0-0
Timestamp: 2025-08-08T15:10:46.436Z
Learning: Repo: unkeyed/unkey — Preference: If imeyer comments “issue” on a PR thread, automatically open a thorough GitHub issue (Summary, Impact, Where, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References), include backlinks to the PR and comment, and assign to imeyer.
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/workflows/job_detect_changes.yaml:111-114
Timestamp: 2025-08-08T16:10:00.224Z
Learning: Repo: unkeyed/unkey — When imeyer writes “make me an issue” on a PR thread, automatically open a thorough GitHub issue (Summary, Impact, Where, Repro, Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References), include backlinks to the PR and comment, and assign to imeyer.
Learnt from: imeyer
Repo: unkeyed/unkey PR: 3755
File: .github/actions/setup-node/action.yaml:0-0
Timestamp: 2025-08-08T15:10:46.436Z
Learning: For repo unkeyed/unkey and PR review workflows: When imeyer comments "issue" on a thread, automatically create a thorough GitHub issue (sections: Summary, Impact, Where, Repro/Observed vs Expected, Acceptance Criteria, Validation Plan, Out of Scope, References), include backlinks to the PR and the specific comment, and assign the issue to imeyer.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 3297
File: apps/dashboard/lib/trpc/routers/authorization/roles/query.ts:210-323
Timestamp: 2025-06-04T20:13:12.060Z
Learning: The user ogzhanolguncu prefers explicit, duplicated code over abstracted helper functions when it improves readability, even if it means some duplication in filter building functions in the authorization roles query module.
Learnt from: ogzhanolguncu
Repo: unkeyed/unkey PR: 2825
File: apps/dashboard/app/(app)/logs-v2/hooks/use-bookmarked-filters.ts:0-0
Timestamp: 2025-01-30T20:51:44.359Z
Learning: The user (ogzhanolguncu) prefers to handle refactoring suggestions in separate PRs to maintain focus in the current PR.
📚 Learning: 2025-08-21T15:54:45.198Z
Learnt from: chronark
Repo: unkeyed/unkey PR: 3825
File: go/internal/services/usagelimiter/limit.go:38-0
Timestamp: 2025-08-21T15:54:45.198Z
Learning: In go/internal/services/usagelimiter/limit.go, the UpdateKeyCreditsDecrement operation cannot be safely wrapped with db.WithRetry due to the lack of idempotency mechanisms in the current tech stack. Retrying this non-idempotent write operation risks double-charging users if the first attempt commits but the client sees a transient error.
Applied to files:
go/apps/ctrl/services/build/backend/depot/create_build.go
🧬 Code graph analysis (1)
go/apps/ctrl/services/build/backend/depot/create_build.go (5)
go/pkg/db/retry.go (1)
WithRetryContext(36-46)go/pkg/db/project_find_by_id.sql_generated.go (1)
FindProjectByIdRow(31-44)go/pkg/db/project_update_depot_id.sql_generated.go (1)
UpdateProjectDepotIDParams(21-25)go/apps/ctrl/services/build/backend/depot/service.go (1)
Depot(28-38)go/pkg/clickhouse/schema/types.go (1)
BuildStepLogV1(117-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test Go API Local / Test
- GitHub Check: Test API / API Test Local
- GitHub Check: Build / Build
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
go/apps/ctrl/services/build/backend/depot/create_build.go (4)
237-239: Good addition of retry logic for database query.Wrapping the
FindProjectByIdquery withdb.WithRetryContextimproves resilience against transient database errors. SELECT operations are inherently idempotent and safe to retry.
280-290: Retry logic for UPDATE is safe and correct.The
UpdateProjectDepotIDoperation is wrapped indb.WithRetryContext, which is appropriate here. Unlike non-idempotent operations (e.g., decrement), this UPDATE is idempotent—settingdepot_project_idto the same value multiple times produces the same result. Thestruct{}{}return is necessary to satisfy theWithRetryContext[T any]signature.
303-345: Well-structured refactoring of build status processing.The refactored
processBuildStatusLogscorrectly:
- Tracks vertices with logs to populate the
HasLogsfield- Uses a
completedmap to prevent duplicate vertex processing- Includes defensive nil checks for vertices
- Buffers both logs and vertex completions for efficient batch processing
The logic is sound and the implementation is clear.
108-108: Terminal error classification in wrapBuildError is correctly implemented and comprehensive.The verification confirms that
wrapBuildErrorproperly distinguishes terminal from retriable errors. The implementation usesisTerminalBuildErrorwhich matches against 9 terminal error patterns covering all major scenarios: wrong dockerfile path ("no such file or directory"), broken dockerfile ("failed to compute cache key"), dockerfile parsing errors, authentication failures ("permission denied", "unauthenticated"), depot internal errors, region config issues, and build failures. Terminal errors are wrapped withrestate.TerminalError()to prevent retries, while others useconnect.NewError()to allow retries. The implementation is sound and aligns with Restate's error handling model.

What does this PR do?
This PR:
Type of change
How should this be tested?
This is enough to break depot flow.