Skip to content

Conversation

@thomasdesr
Copy link

Summary

  • This PR adds (& enables by default) automatic retry for HTTP 429 (Too Many Requests) the bk api commands
  • Added --verbose flag to observe retry behavior when using the CLI

Note: I didn't want to make this the default anywhere other than the CLI because I don't know what the consequences would be of making this the default 🙃 If you think its safe, happy to make it the default everywhere 🤣

Implementation Details

  • Built following the docs here: https://buildkite.com/docs/apis/rest-api/limits
  • When the client experiences a 429 it won't retry before RateLimit-Reset timer has expired
  • If after that, it still hits another 429, it starts to exponentially backoff on repeated rate limits
  • internal/http gained a few additional options to configure the behavior:
    • WithMaxRetries to control how many attempts & WithMaxRetryDelay() to set an upper bound on retry intervals
    • WithOnRetry callback (currently used by the --verbose flag for printing warnings about backoff being hit)

The HTTP client now automatically retries when the Buildkite API returns
429 responses. On the first retry, it waits exactly the duration specified
in the RateLimit-Reset header. On subsequent retries, it multiplies the
delay exponentially (2x, 4x, etc.) to reduce contention when multiple
clients are competing for quota.

Configurable via WithMaxRetries (default 3) and WithMaxRetryDelay (default 60s).
Adds WithOnRetry(func(attempt int, delay time.Duration)) option to the
HTTP client. The callback is invoked before each retry sleep, allowing
callers to log retry attempts, collect metrics, or implement custom
notification logic.
When --verbose is enabled, the api command logs rate limit retries to
stderr with the delay duration and expected resume time.
Remove default retry settings from the HTTP client to avoid changing
behavior for other commands until we discuss with the team. The bk api
command explicitly opts in to retry behavior.
Previously, min(delay, 0) would clamp delays to 0 when WithMaxRetryDelay
wasn't called. Now only clamp when a positive max is configured.
@thomasdesr thomasdesr requested a review from a team as a code owner December 10, 2025 04:04
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@JoeColeman95 JoeColeman95 enabled auto-merge (squash) December 10, 2025 14:57
JoeColeman95
JoeColeman95 previously approved these changes Dec 10, 2025
@scadu
Copy link
Contributor

scadu commented Dec 10, 2025

Hey @thomasdesr 👋
Thanks for the contribution!
While automatic retry for rate limit handling is usually great idea, I think in context of CLI, we might want to emit an error message stating that rate limit has been reached, and stop the current execution.
RateLimit-Reset could be included in the message to communicate clearly when the limit is reset.

Would you be able to update the PR to make CLI behave that way?

@thomasdesr
Copy link
Author

So that's actually exactly how it behaves today :D When you hit the rate limit the CLI exits non-zero and returns an error that contains the notice that you've hit a back off.

The reason I wanted to add support (either default or opt-in) for respecting and retrying the rate limit within the CLI was that when scripting multiple calls, getting rate limit errors in the middle of a pipeline was quite difficult to handle correctly.

E.g. I was writing something like this to batch download logs for a build:

bk build view --pipeline "$ORG/$PIPELINE" "$build_id" \
  | jq -r '. as $b
      | $b.jobs[]
      | select(.type=="script")
      | [$b.pipeline.slug, $b.number, .id]
      | @tsv' \
  | parallel -j8 --colsep '\t' -u \
      'bk api --verbose "/pipelines/{1}/builds/{2}/jobs/{3}/log" | jq -r .content > {1}/{2}-{3}.log'

I can use something like parallel's built-in retries, but then I'm not actually going to respect the rate limit backoff instruction. To do that I'd need to wrap the bk cli and extract the backoff time from the error message? At that point I might as well use the SDK directly xD

Would love advice if I'm going about this wrong :D

@scadu
Copy link
Contributor

scadu commented Dec 12, 2025

@thomasdesr, after revisiting it, I agree that handling 429s for api commands would be welcome addition. It looks like we currently only handle it for build|job GETs through go-buildkite.

If you can continue working on this pull request, I'd recommend using https://github.com/buildkite/roko to handle retries – it's a library already used by our agent, agent-stack-k8s and others.

Again, thanks a ton for your contribution! 🙇

Use Buildkite's roko library for retry handling instead of the custom
retry loop. This aligns with patterns used in other Buildkite projects
(agent, agent-stack-k8s).
@thomasdesr thomasdesr force-pushed the thomas/add-automatic-429-retry branch from 603731b to 7484bcf Compare December 27, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants