-
Notifications
You must be signed in to change notification settings - Fork 7
Migration from tiered to diskless design draft #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: design/ts-unification
Are you sure you want to change the base?
Conversation
giuseppelillo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One aspect missing is the possibility of split brain / zombie leaders during the migration.
A zombie leader might choose a different boundary than the one chosen by the real leader. This can be solved through leader epochs though (when committing the boundary to the Control Plane, the request with the higher leader epoch wins).
|
|
||
| --- | ||
|
|
||
| ## Tiered storage reads on non-leader replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this design and the one already accepted upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-read the KIP carefully and so far my understanding is that it describes only follower fetch during bootstrap. And here I'm trying to describe client fetching of tiered offsets from follower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's a completely different thing, the KIP is only for reducing the amount of data it needs to fetch from the leader. Follower fetching is already available for Tiered Storage
| - Diskless writes allocate offsets from control-plane state; if the control plane has not published a boundary (`diskless_log_start_offset=-1`), diskless produce fails fast with `NOT_LEADER_OR_FOLLOWER` (retry). | ||
| - If the boundary cannot be fetched due to transient CP unavailability, diskless produce returns `REQUEST_TIMED_OUT`. | ||
|
|
||
| ### Diagram: Admin-trigger + boundary persistence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, this all happens synchronously to the Controller and calling client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly async.
The only synchronous part from producer perspective is that produce requests are getting time out while we are waiting for HW to catch up with LEO and persist selected B0.
|
|
||
| - `B0` and associated metadata (producer state, epochs) are persisted in the control plane and are **immutable once established**. | ||
| - Setting the same `B0` again is idempotent; attempting to change to a different value is rejected. | ||
| - `B0` is only reset to `UNDEFINED` in exceptional cases where the boundary becomes invalid (e.g., `B0 < logStartOffset` after log truncation), requiring re-migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the diskless.enable topic config in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stays there enabled because B0 is set per partition and we can not set diskless.enable to false because other partitions could have diskless data already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the way out of this situation, what should the user do?
| | Before migration | Append to UnifiedLog | Should not receive produce | | ||
| | During initialization (B0 not set) | **Block** until B0 persisted | `NOT_LEADER_OR_FOLLOWER` | | ||
| | After B0 set (local) | Route to diskless storage | `NOT_LEADER_OR_FOLLOWER` | | ||
| | After B0 set (replica doesn't know yet) | N/A | `NOT_LEADER_OR_FOLLOWER` (force refresh) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the N/A here, shouldn't it also be Route to diskless storage as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just perhaps a poor framing - the row for the case when the B0 is set but the replica does not know about it yet. And the column is for the leader - it can't be unaware because it did set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can say "can't happen" instead for example.
| - When the offset is not in the replica's local log (e.g., `localLogStartOffset` advanced to `B0` after migration), the remote segment selection is performed using **leader-epoch aware disambiguation**: | ||
| - Determine the expected leader epoch for the fetch offset from RLMM-provided per-segment leader epoch ranges. | ||
| - Select the unique remote segment matching `(offset range contains fetchOffset) AND (segment leader epoch == expected epoch)`. | ||
| - If no unique match exists, return `NOT_LEADER_OR_FOLLOWER` (correctness-first: never guess bytes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this happen? What does normal tiered storage do in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just a safeguard again in case of any inconsistencies. Effectively this should never happen.
|
|
||
| - `B0` is chosen as the partition **log end offset** at the moment migration is triggered, by **force-rolling** the active segment and taking `B0 = rolledSegment.baseOffset`. | ||
| - **Transaction handling**: Migration must wait for all ongoing transactions to complete before selecting `B0`. If there are uncommitted transactions: | ||
| - The migration process waits for all transactions to commit or abort (via transaction coordinator timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also prevent new transactions from starting? Otherwise, this wait may be infinite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed so but will add it to the doc to make it explicit.
| - The migration process waits for all transactions to commit or abort (via transaction coordinator timeout) | ||
| - OR uses `lastStableOffset` as `B0` to ensure no transaction spans the boundary | ||
| - This prevents splitting a transaction across tiered and diskless regions, which would break transactional atomicity | ||
| - The leader **flushes the prefix** `[< B0]` before persisting `B0` to the control plane (durability). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we wait until B0 is replicated to the ISR? By this, we eliminate (🤔?) the possibility of log gap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO. So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU. Or am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO
Yeah, that's what I mean.
So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU.
Kafka folks would disagree and say that replication is enough. But I think one flush won't hurt :)
| @@ -0,0 +1,607 @@ | |||
| # Tiered → Diskless Migration (Design) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing underspecified: how retention will work in hybrid topics
| - This flush is a **hard fsync** for segment bytes and indexes: Kafka segment flush ultimately calls `FileChannel.force(true)` (via `FileRecords.flush()`), and also flushes the log directory for crash-consistent segment creation. | ||
| - Records produced after trigger are assigned offsets starting at `B0` and are written to diskless storage. | ||
| - The broker **atomically seeds producer state and persists `B0`** so the suffix never becomes visible without idempotent continuity metadata. | ||
| - Control-plane persistence is **leader-epoch fenced**: a stale leader epoch cannot overwrite `B0`. A **higher** epoch may overwrite **only if `B0` is unset/reset** (to avoid orphaning diskless suffix data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me which is the situation where B0 is unset and can be overwritten. Looking at the flow diagram, before the migration there is no B0 set, and after calling initializeDisklessMigration B0 is set. So it can only be set once
Delete this text and replace it with a detailed description of your change. The
PR title and body will become the squashed commit message.
If you would like to tag individuals, add some commentary, upload images, or
include other supplemental information that should not be part of the eventual
commit message, please use a separate comment.
If applicable, please include a summary of the testing strategy (including
rationale) for the proposed change. Unit and/or integration tests are expected
for any behavior change and system tests should be considered for larger
changes.