Migration from tiered to diskless design draft #477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

AnatolyPopov wants to merge 2 commits into design/ts-unification from anatolii/ts-migratoin-design

+634 −0

Contributor

AnatolyPopov commented Jan 13, 2026

Delete this text and replace it with a detailed description of your change. The
PR title and body will become the squashed commit message.

If you would like to tag individuals, add some commentary, upload images, or
include other supplemental information that should not be part of the eventual
commit message, please use a separate comment.

If applicable, please include a summary of the testing strategy (including
rationale) for the proposed change. Unit and/or integration tests are expected
for any behavior change and system tests should be considered for larger
changes.


          Migration from tiered to diskless design draft

0a4ecb1

giuseppelillo reviewed

View reviewed changes

Contributor

giuseppelillo left a comment

One aspect missing is the possibility of split brain / zombie leaders during the migration.
A zombie leader might choose a different boundary than the one chosen by the real leader. This can be solved through leader epochs though (when committing the boundary to the Control Plane, the request with the higher leader epoch wins).

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
              ---

              ## Tiered storage reads on non-leader replicas

Contributor

giuseppelillo Jan 14, 2026

What's the difference between this design and the one already accepted upstream?

Contributor Author

AnatolyPopov Jan 20, 2026

I re-read the KIP carefully and so far my understanding is that it describes only follower fetch during bootstrap. And here I'm trying to describe client fetching of tiered offsets from follower.

Contributor

giuseppelillo Jan 20, 2026

Right, it's a completely different thing, the KIP is only for reducing the amount of data it needs to fetch from the leader. Follower fetching is already available for Tiered Storage

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
              - Diskless writes allocate offsets from control-plane state; if the control plane has not published a boundary (`diskless_log_start_offset=-1`), diskless produce fails fast with `NOT_LEADER_OR_FOLLOWER` (retry).

              - If the boundary cannot be fetched due to transient CP unavailability, diskless produce returns `REQUEST_TIMED_OUT`.

              ### Diagram: Admin-trigger + boundary persistence

Member

ivanyu Jan 20, 2026

I guess, this all happens synchronously to the Controller and calling client?

Contributor Author

AnatolyPopov Jan 21, 2026

Mostly async.
The only synchronous part from producer perspective is that produce requests are getting time out while we are waiting for HW to catch up with LEO and persist selected B0.

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md Outdated

    
              - `B0` and associated metadata (producer state, epochs) are persisted in the control plane and are **immutable once established**.

              - Setting the same `B0` again is idempotent; attempting to change to a different value is rejected.

              - `B0` is only reset to `UNDEFINED` in exceptional cases where the boundary becomes invalid (e.g., `B0 < logStartOffset` after log truncation), requiring re-migration.

Member

ivanyu Jan 20, 2026

What happens to the diskless.enable topic config in this case?

Contributor Author

AnatolyPopov Jan 21, 2026

It stays there enabled because B0 is set per partition and we can not set diskless.enable to false because other partitions could have diskless data already.

Member

ivanyu Jan 23, 2026

What's the way out of this situation, what should the user do?

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
              | Before migration | Append to UnifiedLog | Should not receive produce |

              | During initialization (B0 not set) | **Block** until B0 persisted | `NOT_LEADER_OR_FOLLOWER` |

              | After B0 set (local) | Route to diskless storage | `NOT_LEADER_OR_FOLLOWER` |

              | After B0 set (replica doesn't know yet) | N/A | `NOT_LEADER_OR_FOLLOWER` (force refresh) |

Member

ivanyu Jan 20, 2026

I don't understand the N/A here, shouldn't it also be Route to diskless storage as above?

Contributor Author

AnatolyPopov Jan 21, 2026

That's just perhaps a poor framing - the row for the case when the B0 is set but the replica does not know about it yet. And the column is for the leader - it can't be unaware because it did set it.

Contributor Author

AnatolyPopov Jan 21, 2026

We can say "can't happen" instead for example.

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md Outdated

    
              - When the offset is not in the replica's local log (e.g., `localLogStartOffset` advanced to `B0` after migration), the remote segment selection is performed using **leader-epoch aware disambiguation**:

                  - Determine the expected leader epoch for the fetch offset from RLMM-provided per-segment leader epoch ranges.

                  - Select the unique remote segment matching `(offset range contains fetchOffset) AND (segment leader epoch == expected epoch)`.

                  - If no unique match exists, return `NOT_LEADER_OR_FOLLOWER` (correctness-first: never guess bytes).

Member

ivanyu Jan 20, 2026

Can this happen? What does normal tiered storage do in this case?

Contributor Author

AnatolyPopov Jan 21, 2026

It is just a safeguard again in case of any inconsistencies. Effectively this should never happen.

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md Outdated

    
              - `B0` is chosen as the partition **log end offset** at the moment migration is triggered, by **force-rolling** the active segment and taking `B0 = rolledSegment.baseOffset`.

              - **Transaction handling**: Migration must wait for all ongoing transactions to complete before selecting `B0`. If there are uncommitted transactions:

                  - The migration process waits for all transactions to commit or abort (via transaction coordinator timeout)

Member

ivanyu Jan 20, 2026

Shouldn't we also prevent new transactions from starting? Otherwise, this wait may be infinite

Contributor Author

AnatolyPopov Jan 21, 2026

I assumed so but will add it to the doc to make it explicit.

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
                  - The migration process waits for all transactions to commit or abort (via transaction coordinator timeout)

                  - OR uses `lastStableOffset` as `B0` to ensure no transaction spans the boundary

                  - This prevents splitting a transaction across tiered and diskless regions, which would break transactional atomicity

              - The leader **flushes the prefix** `[< B0]` before persisting `B0` to the control plane (durability).

Member

ivanyu Jan 20, 2026

Shouldn't we wait until B0 is replicated to the ISR? By this, we eliminate (🤔?) the possibility of log gap

Contributor Author

AnatolyPopov Jan 21, 2026

Good catch, I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO. So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU. Or am I missing anything?

Member

ivanyu Jan 21, 2026

I was going to clarify that at this point we expect the replication to be already happen, e.g. HW==LEO

Yeah, that's what I mean.

So, at this point, I assume that replication happened already, and this is an additional safeguard because replication does not require the data to be flushed to disk, it can still be just in page cache AFAIU.

Kafka folks would disagree and say that replication is enough. But I think one flush won't hurt :)

ivanyu reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
            @@ -0,0 +1,607 @@
          
              # Tiered → Diskless Migration (Design)

Member

ivanyu Jan 20, 2026 •

edited

Loading

One thing underspecified: how retention will work in hybrid topics


          fixup! Migration from tiered to diskless design draft

c1109b5

giuseppelillo reviewed

View reviewed changes

docs/inkless/ts-unification/MIGRATION_TIERED_TO_DISKLESS.md

    
                  - This flush is a **hard fsync** for segment bytes and indexes: Kafka segment flush ultimately calls `FileChannel.force(true)` (via `FileRecords.flush()`), and also flushes the log directory for crash-consistent segment creation.

              - Records produced after trigger are assigned offsets starting at `B0` and are written to diskless storage.

              - The broker **atomically seeds producer state and persists `B0`** so the suffix never becomes visible without idempotent continuity metadata.

              - Control-plane persistence is **leader-epoch fenced**: a stale leader epoch cannot overwrite `B0`. A **higher** epoch may overwrite **only if `B0` is unset/reset** (to avoid orphaning diskless suffix data).

Contributor

giuseppelillo Jan 26, 2026

It's not clear to me which is the situation where B0 is unset and can be overwritten. Looking at the flow diagram, before the migration there is no B0 set, and after calling initializeDisklessMigration B0 is set. So it can only be set once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet