-
Notifications
You must be signed in to change notification settings - Fork 1.3k
gc: implement detailed report for --dry run #10937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10937 +/- ##
===========================================
- Coverage 90.68% 70.31% -20.38%
===========================================
Files 504 503 -1
Lines 39795 41016 +1221
Branches 3141 3237 +96
===========================================
- Hits 36087 28839 -7248
- Misses 3042 11323 +8281
- Partials 666 854 +188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi maintainers, The CI is failing as expected because this work depends on an unmerged PR in I'll convert this to a Draft PR until the dependency is merged. |
|
Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain. There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible. I don’t think we need tables here. The dir/file distinction and While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead. If Alternatively, we could just log them inside |
Hi! Thanks for the feedback - you're right that this is overcomplicated. I want to simplify it following your suggestions. Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to Because if I only modify the DVC layer, I'd have to call Just want to confirm before I start making changes. Thanks! |
yes, just adding for path in (dir_paths, file_paths):
if dry:
logger.info("Removing", path)
else:
odb.fs.remove(path)Alternatively, we can also break Line 44 in 7ed8d89
|
Hi! I've implemented the logging approach as you suggested. The paths are now logged inside However, I noticed that the output is duplicated: This happens because In my previous implementation, I had a def _iter_unique_odbs(odbs):
"""
The local cache and repo cache often point to the same ObjectDB instance.
Without deduplication, we would scan the same directory twice
"""
seen = set()
for scheme, odb in odbs:
if odb and odb not in seen:
seen.add(odb)
yield scheme, odbSo I think the fix is to deduplicate ODBs before calling seen_odbs = set()
for scheme, odb in self.cache.by_scheme():
if not odb or odb in seen_odbs:
continue
seen_odbs.add(odb)
num_removed = ogc(odb, used_obj_ids, jobs=jobs, dry=dry)
# ...Does this approach look good to you? |
|
Hi, we can fix that issue separately by keeping a set of |
Summary
Fixes #10585
This PR significantly improves the output of
dvc gc --dry. Previously, a dry run only reported the count of objects to be removed, which left users guessing about what exactly would be deleted.Now, it provides a detailed table listing the objects, allowing users to verify the cleanup target before execution.
Example
Regarding the code of this unit test, it will output:
Changes
dvc-dataGC interface to iterate over garbage objects.Type,OID(MD5),Size,Modifiedtime, andPath.You will notice the
Pathcolumn displays the internal cache path (e.g.,.dvc/cache/files/md5/...) rather than the original workspace filename.Why internal paths?
Retrieving the original workspace path for a garbage object is complex. Since gc works at the ODB (Object Database) level, it doesn't inherently know where the file came from. Finding the original name would require scanning
git reflogor refactoring upper-layer architecture, both of which involve significant complexity or performance costs.However, this output is still highly valuable:
SizeandModifiedtimestamps act as strong evidence. Users can often identify "that 2GB model file from last Tuesday" just by looking at these metadata columns.git fsckorprune), where only hashes are displayed.For this PR, I opted for this simple, robust implementation. It provides immediate value while leaving room to discuss more advanced reverse-lookup logic in future iterations.
Performance & Remote Storage
The logic explicitly checks
if isinstance(odb.fs, LocalFileSystem)before fetching detailed metadata (Size, Modified Time).Reasoning:
Retrieving
statinformation for every single garbage object on remote storage (S3, Azure, etc.) would trigger a separate network request per object. For large projects with thousands of garbage files, this would makedvc gc --dryunacceptably slow and potentially hit API rate limits. Therefore, detailed metadata is skipped for non-local filesystems to maintain performance.Testing
Added comprehensive tests in
tests/func/test_gc.pyto cover the new--drymode:--cloudflag (scanning both local and remote caches)All tests pass locally.
Dependencies
Requires updated dvc-data (PR: treeverse/dvc-data#650)