Skip to content

[Feature Request][Spark] Add table identifier for VACUUM log messages for better observability #5594

@AnudeepKonaboina

Description

@AnudeepKonaboina

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel

Overview

Improve observability of Delta Lake VACUUM operations by including the table identifier (table ID / table path) in all VACUUM‑related log messages. This helps users understand exactly which table is being vacuumed when multiple Delta tables are processed within the same job. Though we have the table path when the VACUUM job starts , it is not very debug friendly

Motivation

Currently, when user runs VACUUM on Multiple tables , the log messages do not include any table identifier. When multiple tables are processed in a single job, users cannot determine which table is being cleaned up from the logs alone:

Current logs:

25/11/27 05:05:14 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:14 GMT in /tmp/delta/vacuum
25/11/27 05:05:26 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/vacuum. The amount of data to be deleted is 0 (in bytes)
25/11/27 05:05:29 INFO VacuumCommand: Deleted 0 files (0 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615114082,1,4,0,0,10038,1158,1764219914030,1764219929008,8,8,8,false,0,0,7,None,None,FULL)

25/11/27 05:05:30 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:30 GMT in  /tmp/delta/vacuum_2
25/11/27 05:05:36 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/vacuum_2. The amount of data to be deleted is 10638 (in bytes)
25/11/27 05:05:37 INFO VacuumCommand: Deleted 9 files (10638 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615130357,1,10,9,10638,4660,1394,1764219930305,1764219937437,8,8,8,false,0,0,31,None,None,FULL)

25/11/27 05:05:38 INFO VacuumCommand: Starting garbage collection (dryRun = false) of untracked files older than 20 Nov 2025 05:05:38 GMT in  /tmp/delta/orders
25/11/27 05:05:44 INFO VacuumCommand: Deleting untracked files and empty directories in  /tmp/delta/orders The amount of data to be deleted is 0 (in bytes)
25/11/27 05:05:45 INFO VacuumCommand: Deleted 0 files (0 bytes) and directories in a total of 1 directories. Vacuum stats: DeltaVacuumStats(false,None,604800000,1763615138428,1,4,0,0,4887,938,1764219938390,1764219945316,8,8,8,false,0,0,5,None,None,FULL)

The above logs make it hard to:

  • Attribute long‑running VACUUM operations to specific tables
  • Debug problems when a particular table’s VACUUM fails or behaves unexpectedly
  • Distinguish multiple VACUUM runs in across different tables

Further details

The implementation would add a [tableId=<table_id>] to all the VACUUM logging messages . The table id is already available in VacuumCommand.scala class as

 val snapshot = table.update()
 deltaLog.protocolWrite(snapshot.protocol)
 val tableId = snapshot.metadata.id
  • All the changes are made to VacuumCommand.scala class
Expected output after the change:
INFO VacuumCommand: [tableId=abcd1234] Starting garbage collection ...
INFO VacuumCommand: [tableId=abcd1234] Deleting untracked files and empty directories in ...
INFO VacuumCommand: [tableId=abcd1234] Deleted N files (...) Vacuum stats: ...

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions