gc: implement detailed report for --dry run #10937

GreenHatHG · 2025-12-25T09:30:15Z

Summary

This PR significantly improves the output of dvc gc --dry. Previously, a dry run only reported the count of objects to be removed, which left users guessing about what exactly would be deleted.

Now, it provides a detailed table listing the objects, allowing users to verify the cleanup target before execution.

Example

Regarding the code of this unit test, it will output:

***** captured *****
total 3 objects, 14.7k reclaimed
Type  OID             Size  Modified             Path
----  --------  ----------  -------------------  ----
file  f27f5596          6B  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
dir   933b0b31         61B  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
file  32fc8a46       14.6k  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e

********************

def test_gc_dry_run_report_output(tmp_dir, dvc, capsys):
    # Garbage object 1: A standalone file
    (garbage_stage,) = tmp_dir.dvc_gen("garbage_file", "this is garbage"*1000)

    # Garbage objects 2 & 3: A directory and its inner content
    (garbage_dir_stage,) = tmp_dir.dvc_gen({"garbage_dir": {"f": "in dir"}})

    os.remove(garbage_stage.relpath)
    os.remove(garbage_dir_stage.relpath)

    ret = main(["gc", "-w", "--dry"])
    assert ret == 0

    captured = capsys.readouterr().out
    print("***** captured *****")
    print(captured)
    print("*" * 20)

Changes

Integrated with the refactored dvc-data GC interface to iterate over garbage objects.
Implemented a tabular output showing: Type, OID (MD5), Size, Modified time, and Path.
Added a summary footer showing total objects and estimated space reclaimed.

⚠️ Important Note on "Path"

You will notice the Path column displays the internal cache path (e.g., .dvc/cache/files/md5/...) rather than the original workspace filename.

Why internal paths?
Retrieving the original workspace path for a garbage object is complex. Since gc works at the ODB (Object Database) level, it doesn't inherently know where the file came from. Finding the original name would require scanning git reflog or refactoring upper-layer architecture, both of which involve significant complexity or performance costs.

However, this output is still highly valuable:

Traceability: The Size and Modified timestamps act as strong evidence. Users can often identify "that 2GB model file from last Tuesday" just by looking at these metadata columns.
Consistency: This behavior mirrors how Git handles dangling objects (e.g., in git fsck or prune), where only hashes are displayed.

For this PR, I opted for this simple, robust implementation. It provides immediate value while leaving room to discuss more advanced reverse-lookup logic in future iterations.

Performance & Remote Storage

The logic explicitly checks if isinstance(odb.fs, LocalFileSystem) before fetching detailed metadata (Size, Modified Time).

Reasoning:
Retrieving stat information for every single garbage object on remote storage (S3, Azure, etc.) would trigger a separate network request per object. For large projects with thousands of garbage files, this would make dvc gc --dry unacceptably slow and potentially hit API rate limits. Therefore, detailed metadata is skipped for non-local filesystems to maintain performance.

Testing

Added comprehensive tests in tests/func/test_gc.py to cover the new --dry mode:

Accurate identification and reporting of garbage objects
Correct structured tabular report formatting, including human-readable sizes, timestamps, and summary statistics
Proper behavior when no garbage is present
Integration with --cloud flag (scanning both local and remote caches)
Robustness against edge cases, such as corrupted cache directories or missing files (graceful handling without crashes)

All tests pass locally.

Dependencies

Requires updated dvc-data (PR: treeverse/dvc-data#650)

codecov · 2025-12-25T09:37:15Z

Codecov Report

❌ Patch coverage is 33.93939% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.31%. Comparing base (2431ec6) to head (e924dd8).
⚠️ Report is 172 commits behind head on main.

Files with missing lines	Patch %	Lines
dvc/repo/gc.py	42.30%	57 Missing and 3 partials ⚠️
tests/func/test_gc.py	19.67%	49 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #10937       +/-   ##
===========================================
- Coverage   90.68%   70.31%   -20.38%     
===========================================
  Files         504      503        -1     
  Lines       39795    41016     +1221     
  Branches     3141     3237       +96     
===========================================
- Hits        36087    28839     -7248     
- Misses       3042    11323     +8281     
- Partials      666      854      +188

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…port tests

GreenHatHG · 2025-12-25T10:31:15Z

Hi maintainers,

The CI is failing as expected because this work depends on an unmerged PR in dvc-data which introduces the is_dir_hash function.
treeverse/dvc-data#650

I'll convert this to a Draft PR until the dependency is merged.

skshetry · 2026-01-03T04:46:48Z

Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain.

There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible.

I don’t think we need tables here. The dir/file distinction and oid are internal implementation details that we don’t expose to users, and I don’t see “Modified” as particularly meaningful for a content-addressable storage. The only field that really matters to users is the path (and maybe the count of objects that will be deleted).

While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead.

If dvc_data.hashfile.gc.gc() doesn’t currently provide paths, we could consider breaking the API to return them instead of just a file count, and then display those paths here.

Alternatively, we could just log them inside gc() as they are deleted.

GreenHatHG · 2026-01-05T09:35:38Z

Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain.

There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible.

I don’t think we need tables here. The dir/file distinction and oid are internal implementation details that we don’t expose to users, and I don’t see “Modified” as particularly meaningful for a content-addressable storage. The only field that really matters to users is the path (and maybe the count of objects that will be deleted).

While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead.

If dvc_data.hashfile.gc.gc() doesn’t currently provide paths, we could consider breaking the API to return them instead of just a file count, and then display those paths here.

Alternatively, we could just log them inside gc() as they are deleted.

Hi! Thanks for the feedback - you're right that this is overcomplicated. I want to simplify it following your suggestions.

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

Because if I only modify the DVC layer, I'd have to call iter_garbage() separately to get the paths, which brings back the separation issue you mentioned.

Just want to confirm before I start making changes. Thanks!

skshetry · 2026-01-05T10:11:27Z

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

yes, just adding logger.info() calls inside the gc() function if dry=True. If I am not wrong, those logs should show up in the output.

for path in (dir_paths, file_paths):
	if dry:
        logger.info("Removing", path)
    else:
	    odb.fs.remove(path)

Alternatively, we can also break gc API. It's not a big deal, as we limit dvc-data to the next minor version.

dvc/pyproject.toml

Line 44 in 7ed8d89

"dvc-data>=3.17.0,<3.18",

GreenHatHG · 2026-01-06T07:54:29Z

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

yes, just adding logger.info() calls inside the gc() function if dry=True. If I am not wrong, those logs should show up in the output.
for path in (dir_paths, file_paths):
	if dry:
        logger.info("Removing", path)
    else:
	    odb.fs.remove(path)
Alternatively, we can also break gc API. It's not a big deal, as we limit dvc-data to the next minor version.

dvc/pyproject.toml

Line 44 in 7ed8d89

"dvc-data>=3.17.0,<3.18",

Hi! I've implemented the logging approach as you suggested. The paths are now logged inside dvc_data.hashfile.gc.gc() when dry=True.

However, I noticed that the output is duplicated:

WARNING: This will remove all cache except items used in the workspace of the current repo.
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e
Removed 3 objects from repo cache.
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e
Removed 3 objects from local cache.
No unused 'legacy' cache to remove.

This happens because repo cache and local cache often point to the same ODB instance, so we scan and log the same directory twice.

In my previous implementation, I had a _iter_unique_odbs() helper function to deduplicate ODBs before scanning:

def _iter_unique_odbs(odbs):
    """
    The local cache and repo cache often point to the same ObjectDB instance.
    Without deduplication, we would scan the same directory twice
    """
    seen = set()
    for scheme, odb in odbs:
        if odb and odb not in seen:
            seen.add(odb)
            yield scheme, odb

So I think the fix is to deduplicate ODBs before calling dvc.repo.gc.gc(). Something like:

 seen_odbs = set()
  for scheme, odb in self.cache.by_scheme():
      if not odb or odb in seen_odbs:
          continue
      seen_odbs.add(odb)
      num_removed = ogc(odb, used_obj_ids, jobs=jobs, dry=dry)
      # ...

Does this approach look good to you?

skshetry · 2026-01-06T08:34:12Z

Hi, we can fix that issue separately by keeping a set of seen odbs (odbs are hashable based on their path) and skipping those that we have encountered before.

https://github.com/treeverse/dvc-objects/blob/f0f73bb2f7ee0e8d08c3cb0213d08086acdf2e01/src/dvc_objects/db.py#L55

feat(gc): add structured tabular report for dry-run garbage collection

734bcb9

github-project-automation bot added this to DVC Dec 25, 2025

github-project-automation bot moved this to Backlog in DVC Dec 25, 2025

GreenHatHG mentioned this pull request Dec 25, 2025

gc: refactor to expose garbage object iteration for dry-run treeverse/dvc-data#650

Open

refactor(tests): Improve the accuracy and robustness of GC dry run re…

e924dd8

…port tests

GreenHatHG marked this pull request as draft December 25, 2025 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gc: implement detailed report for --dry run #10937

gc: implement detailed report for --dry run #10937

Uh oh!

GreenHatHG commented Dec 25, 2025

Uh oh!

codecov bot commented Dec 25, 2025 •

edited

Loading

Uh oh!

GreenHatHG commented Dec 25, 2025

Uh oh!

skshetry commented Jan 3, 2026 •

edited

Loading

Uh oh!

GreenHatHG commented Jan 5, 2026

Uh oh!

skshetry commented Jan 5, 2026

Uh oh!

GreenHatHG commented Jan 6, 2026

Uh oh!

skshetry commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gc: implement detailed report for --dry run #10937

Are you sure you want to change the base?

gc: implement detailed report for --dry run #10937

Uh oh!

Conversation

GreenHatHG commented Dec 25, 2025

Summary

Example

Changes

⚠️ Important Note on "Path"

Performance & Remote Storage

Testing

Dependencies

Uh oh!

codecov bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GreenHatHG commented Dec 25, 2025

Uh oh!

skshetry commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GreenHatHG commented Jan 5, 2026

Uh oh!

skshetry commented Jan 5, 2026

Uh oh!

GreenHatHG commented Jan 6, 2026

Uh oh!

skshetry commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 25, 2025 •

edited

Loading

skshetry commented Jan 3, 2026 •

edited

Loading