Skip to content

Conversation

@AnatolyPopov
Copy link
Contributor

Delete this text and replace it with a detailed description of your change. The
PR title and body will become the squashed commit message.

If you would like to tag individuals, add some commentary, upload images, or
include other supplemental information that should not be part of the eventual
commit message, please use a separate comment.

If applicable, please include a summary of the testing strategy (including
rationale) for the proposed change. Unit and/or integration tests are expected
for any behavior change and system tests should be considered for larger
changes.

Copy link
Contributor

@giuseppelillo giuseppelillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One aspect missing is the possibility of split brain / zombie leaders during the migration.
A zombie leader might choose a different boundary than the one chosen by the real leader. This can be solved through leader epochs though (when committing the boundary to the Control Plane, the request with the higher leader epoch wins).


---

## Tiered storage reads on non-leader replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this design and the one already accepted upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-read the KIP carefully and so far my understanding is that it describes only follower fetch during bootstrap. And here I'm trying to describe client fetching of tiered offsets from follower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's a completely different thing, the KIP is only for reducing the amount of data it needs to fetch from the leader. Follower fetching is already available for Tiered Storage

- Diskless writes allocate offsets from control-plane state; if the control plane has not published a boundary (`diskless_log_start_offset=-1`), diskless produce fails fast with `NOT_LEADER_OR_FOLLOWER` (retry).
- If the boundary cannot be fetched due to transient CP unavailability, diskless produce returns `REQUEST_TIMED_OUT`.

### Diagram: Admin-trigger + boundary persistence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, this all happens synchronously to the Controller and calling client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly async.
The only synchronous part from producer perspective is that produce requests are getting time out while we are waiting for HW to catch up with LEO and persist selected B0.


- `B0` and associated metadata (producer state, epochs) are persisted in the control plane and are **immutable once established**.
- Setting the same `B0` again is idempotent; attempting to change to a different value is rejected.
- `B0` is only reset to `UNDEFINED` in exceptional cases where the boundary becomes invalid (e.g., `B0 < logStartOffset` after log truncation), requiring re-migration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the diskless.enable topic config in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stays there enabled because B0 is set per partition and we can not set diskless.enable to false because other partitions could have diskless data already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the way out of this situation, what should the user do?

| Before migration | Append to UnifiedLog | Should not receive produce |
| During initialization (B0 not set) | **Block** until B0 persisted | `NOT_LEADER_OR_FOLLOWER` |
| After B0 set (local) | Route to diskless storage | `NOT_LEADER_OR_FOLLOWER` |
| After B0 set (replica doesn't know yet) | N/A | `NOT_LEADER_OR_FOLLOWER` (force refresh) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the N/A here, shouldn't it also be Route to diskless storage as above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just perhaps a poor framing - the row for the case when the B0 is set but the replica does not know about it yet. And the column is for the leader - it can't be unaware because it did set it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can say "can't happen" instead for example.

- When the offset is not in the replica's local log (e.g., `localLogStartOffset` advanced to `B0` after migration), the remote segment selection is performed using **leader-epoch aware disambiguation**:
- Determine the expected leader epoch for the fetch offset from RLMM-provided per-segment leader epoch ranges.
- Select the unique remote segment matching `(offset range contains fetchOffset) AND (segment leader epoch == expected epoch)`.
- If no unique match exists, return `NOT_LEADER_OR_FOLLOWER` (correctness-first: never guess bytes).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen? What does normal tiered storage do in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a safeguard again in case of any inconsistencies. Effectively this should never happen.


- `B0` is chosen as the partition **log end offset** at the moment migration is triggered, by **force-rolling** the active segment and taking `B0 = rolledSegment.baseOffset`.
- **Transaction handling**: Migration must wait for all ongoing transactions to complete before selecting `B0`. If there are uncommitted transactions:
- The migration process waits for all transactions to commit or abort (via transaction coordinator timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also prevent new transactions from starting? Otherwise, this wait may be infinite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed so but will add it to the doc to make it explicit.

- The migration process waits for all transactions to commit or abort (via transaction coordinator timeout)
- OR uses `lastStableOffset` as `B0` to ensure no transaction spans the boundary
- This prevents splitting a transaction across tiered and diskless regions, which would break transactional atomicity
- The leader **flushes the prefix** `[< B0]` before persisting `B0` to the control plane (durability).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we wait until B0 is replicated to the ISR? By this, we eliminate (🤔?) the possibility of log gap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO. So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU. Or am I missing anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO

Yeah, that's what I mean.

So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU.

Kafka folks would disagree and say that replication is enough. But I think one flush won't hurt :)

@@ -0,0 +1,607 @@
# Tiered → Diskless Migration (Design)
Copy link
Member

@ivanyu ivanyu Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing underspecified: how retention will work in hybrid topics

- This flush is a **hard fsync** for segment bytes and indexes: Kafka segment flush ultimately calls `FileChannel.force(true)` (via `FileRecords.flush()`), and also flushes the log directory for crash-consistent segment creation.
- Records produced after trigger are assigned offsets starting at `B0` and are written to diskless storage.
- The broker **atomically seeds producer state and persists `B0`** so the suffix never becomes visible without idempotent continuity metadata.
- Control-plane persistence is **leader-epoch fenced**: a stale leader epoch cannot overwrite `B0`. A **higher** epoch may overwrite **only if `B0` is unset/reset** (to avoid orphaning diskless suffix data).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me which is the situation where B0 is unset and can be overwritten. Looking at the flow diagram, before the migration there is no B0 set, and after calling initializeDisklessMigration B0 is set. So it can only be set once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants