-
Notifications
You must be signed in to change notification settings - Fork 21
Deduplication during pull #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughReworks missing-entry collection in fetch.rs: adds Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)
320-332: Duplicate tree traversal:list_all_filesis called twice per subtree.
list_all_filesis invoked incollect_missing_entries_for_subtree(line 374) and again here (line 329). Consider returning the collected hashes fromcollect_missing_entries_for_subtreeto avoid the redundant traversal.🔎 Suggested approach
Modify
collect_missing_entries_for_subtreeto return or populate the new file hashes it encounters:fn collect_missing_entries_for_subtree( tree: &MerkleTreeNode, subtree_path: &PathBuf, missing_entries: &mut HashSet<Entry>, total_bytes: &mut u64, shared_hashes: &HashSet<MerkleHash>, new_hashes: &mut HashSet<MerkleHash>, // collect newly seen hashes ) -> Result<(), OxenError>Then the caller can extend
shared_file_hasheswithnew_hasheswithout re-traversing the tree.
349-361: Same duplicate traversal issue applies here.For consistency with the earlier suggestion, this branch should also be refactored to avoid calling
list_all_filestwice per commit.
232-281: Remove redundant.clone()onMerkleHash.
MerkleHashderivesCopy, making.clone()on lines 257 and 276 unnecessary. Use dereference instead:- file_hashes.insert(file.file_node.hash().clone()); + file_hashes.insert(*file.file_node.hash());Apply to both locations.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
oxen-rust/src/lib/src/core/v_latest/fetch.rs(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (2)
oxen-rust/src/lib/src/model/merkle_tree/merkle_hash.rs (1)
new(18-20)oxen-rust/src/lib/src/repositories/tree.rs (2)
get_subtree_by_depth_with_unique_children(367-390)list_all_files(702-709)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Test Suite / Test Suite (macos-latest)
- GitHub Check: Test Suite / Test Suite (windows-latest)
- GitHub Check: Test Suite / Test Suite (ubuntu-latest)
- GitHub Check: Lint / Cargo check, format, clippy + Ruff
🔇 Additional comments (3)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)
12-12: LGTM!The
MerkleHashimport is correctly added and necessary for the new deduplication logic usingHashSet<MerkleHash>.
292-298: LGTM!Good approach to initialize the deduplication set from the HEAD commit. This ensures files already present locally are not re-downloaded.
367-393: No action needed. The functioncollect_missing_entries_for_subtreeis private to this module (notpub), so it has no external callers. Both internal call sites within the same file are already updated with the newshared_hashesparameter.Likely an incorrect or invalid review comment.
295c3d1 to
a81a507
Compare
rpschoenburg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commit, I think this is good. The prior optimization (skipping loading parts of the merkle tree that have already been loaded into memory) is restored, and this still does the de-dup.
Considering the need for it in multiple places now, I'm thinking I'll go ahead and make a new tree loading method to collect all the file nodes when we're reading the tree into memory, which would eliminate the need for the tree traversal in collect_missing_entries. But, I think this is good for now
|
@rpschoenburg exactly what I was thinking, we now have it in pull and push. so would be nice to have that. |
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.