Skip to content

[Feature Request][Spark] Vacuum progress logging for better observability #5611

@AnudeepKonaboina

Description

@AnudeepKonaboina

Feature request

Which Delta project/connector is this regarding?

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Improve VACUUM progress logging and metrics to identify how VACUUM is progressing when a user runs

  1. FULL VACUUM (filesystem-based directory listing) to find eligible stale files not reference in the tx. log
  2. LITE VACUUM (Delta log–based scan of RemoveFile / CDF actions),

Motivation

Today when user triggers a VACUUM command, we only get the below logs:

25/12/02 04:51:29 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 2 Dec 2025 04:51:29 GMT in hdfs://hadoop.spark:9000/tmp/delta_vacuum_progress
25/12/02 04:51:41 INFO VacuumCommand: Deleting untracked files and empty directories in hdfs://hadoop.spark:9000/tmp/delta_vacuum_progress. The amount of data to be deleted is 447370000324 (in bytes)

-- After ~1 hr 30 minutes ----

25/12/02 06:21:45 INFO VacuumCommand: Deleted 250000 files (44737324 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,Some(0),604800000,1764651089738,1,2500,2500,44737324,10938,1886,1764651089735,1764651105425,0,50,Some(0),Some(50),LITE)

From the above log the VACUUM completed successfully after 90 min , but during that time there is no log which tells us what is happening or how may files have been listed so far which makes it difficult to understand about the progress of the VACUUM command

Proposed improvements

  • A progress thread which monitors the progress of the VACUUM command and provides the no of files listed every 10 minutes by default so that we know what exactly is happening in the backend.
  • This must be logged in our logs so that we can have a track of the time taken vs no of files listed.
  • Also the implementation must be different for VACUUM_FULL vs LITE.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions