Skip to content

Restore Failure from External Checkpoint during Upgrade #289

@sethsaperstein-lyft

Description

@sethsaperstein-lyft

overview

Jobs that enable DELETE_ON_CANCELLATION for externalized checkpoints will fail during upgrades if the operator attempts to find an externalized checkpoint. The checkpoint directory exists but the _metadata file has been deleted and the job fails to start as its unable to find the _metadata file.

When looking for externalized checkpoints, we should ensure that there is a _metadata file before starting the job with it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions