Skip to content

Conversation

@subygan
Copy link
Collaborator

@subygan subygan commented Dec 19, 2025

Summary by CodeRabbit

  • Refactor
    • Fetch/deduplication now tracks file hashes across subtree traversal, reducing duplicate downloads and improving total-byte accuracy and fetch performance.
  • Chores
    • Archive creation now uses a faster compression setting, producing archives quicker (with different compression trade-offs).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Reworks missing-entry collection in fetch.rs: adds MerkleHash to public imports and signatures, introduces a shared file_hashes_seen: HashSet<MerkleHash> for deduplication, removes per-subtree path param, and switches traversal to node-based Merkle tree walking with CommitEntry::from_node.

Changes

Cohort / File(s) Summary
Fetch dedupe & traversal
oxen-rust/src/lib/src/core/v_latest/fetch.rs
Added MerkleHash to imports and public signatures; replaced subtree_path param with shared file_hashes_seen: &mut HashSet<MerkleHash>; changed subtree processing to walk Merkle nodes and create CommitEntrys from nodes (CommitEntry::from_node); dedupe by merkle hash and update missing_entries/total_bytes only for unseen hashes.
Tree compression change
oxen-rust/src/lib/src/repositories/tree.rs
Switched gzip compression setting to Compression::fast() in compress_tree, compress_nodes, compress_node, and compress_commits, altering encoder performance/level while keeping APIs and control flow unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify all call sites for collect_missing_entries_for_subtree updated to new signature and lifetimes.
  • Confirm deduplication correctness: file_hashes_seen is updated and prevents duplicate entries across subtrees.
  • Inspect CommitEntry::from_node usage and MerkleTreeNodeType handling (File vs FileChunk).
  • Check archive outputs for intended effects of Compression::fast() where compression ratio/throughput matters.

Possibly related PRs

Suggested reviewers

  • rpschoenburg

Poem

"I hopped through nodes, a nimble lab,
Collected hashes from each slab,
One set to banish twin-file frights,
I stitch the tree through days and nights,
— your rabbit reviewer, hat and map 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Deduplication during pull' directly aligns with the main changes: introducing deduplication logic (file_hashes_seen tracking) in the fetch pull operation and optimizing compression performance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch API-110/oxen-fetch-dedup

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a81a507 and 9663243.

📒 Files selected for processing (2)
  • oxen-rust/src/lib/src/core/v_latest/fetch.rs
  • oxen-rust/src/lib/src/repositories/tree.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
🔇 Additional comments (6)
oxen-rust/src/lib/src/repositories/tree.rs (2)

59-59: LGTM: Documentation improvements.

The doc comment clarifications better describe the hash collection behavior.

Also applies to: 81-81


889-889: Verify the compression speed vs. size trade-off.

The switch from Compression::default() to Compression::fast() prioritizes compression speed over compression ratio. This will produce larger archives but compress faster.

Ensure this trade-off aligns with your performance goals, especially considering:

  • Network transfer time for larger archives
  • Storage requirements
  • Overall end-to-end fetch/pull performance

Also applies to: 930-930, 964-964, 990-990

oxen-rust/src/lib/src/core/v_latest/fetch.rs (4)

11-12: LGTM: Import additions support new deduplication logic.

The added imports (EMerkleTreeNode, MerkleTreeNode, MerkleHash) are correctly used in the refactored collect_missing_entries_for_subtree function.


256-256: LGTM: Cross-commit deduplication via shared hash set.

Introducing file_hashes_seen enables deduplication of file entries across multiple commits, ensuring each unique file hash is downloaded only once to the versions directory.


281-286: LGTM: Function signature updated for deduplication.

The signature change removes the unused subtree_path parameter and adds file_hashes_seen to enable cross-commit file deduplication. Call sites are correctly updated.

Also applies to: 306-311, 317-321


323-338: Refactored code creates CommitEntry instances with incomplete path information.

The refactored collect_missing_entries_for_subtree uses tree.walk_tree() which does not track directory context. Consequently, CommitEntry::from_node(&node.node) at line 331 creates entries with only the filename in the path field (via PathBuf::from(file_node.name())), not the full directory path.

Unlike the recursive r_download_entries pattern elsewhere in the same file, which explicitly tracks directory context through recursion and constructs complete paths via directory.join(file_node.name()), this refactored approach cannot reconstruct full paths. While version storage is hash-based and the fetch operation functions correctly, the incomplete path field violates semantic correctness and could cause issues if the path is used in logging, debugging, or downstream operations beyond the immediate fetch context.

Consider tracking directory context during tree traversal or restructuring the callback to pass directory information.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)

320-332: Duplicate tree traversal: list_all_files is called twice per subtree.

list_all_files is invoked in collect_missing_entries_for_subtree (line 374) and again here (line 329). Consider returning the collected hashes from collect_missing_entries_for_subtree to avoid the redundant traversal.

🔎 Suggested approach

Modify collect_missing_entries_for_subtree to return or populate the new file hashes it encounters:

fn collect_missing_entries_for_subtree(
    tree: &MerkleTreeNode,
    subtree_path: &PathBuf,
    missing_entries: &mut HashSet<Entry>,
    total_bytes: &mut u64,
    shared_hashes: &HashSet<MerkleHash>,
    new_hashes: &mut HashSet<MerkleHash>,  // collect newly seen hashes
) -> Result<(), OxenError>

Then the caller can extend shared_file_hashes with new_hashes without re-traversing the tree.


349-361: Same duplicate traversal issue applies here.

For consistency with the earlier suggestion, this branch should also be refactored to avoid calling list_all_files twice per commit.


232-281: Remove redundant .clone() on MerkleHash.

MerkleHash derives Copy, making .clone() on lines 257 and 276 unnecessary. Use dereference instead:

-            file_hashes.insert(file.file_node.hash().clone());
+            file_hashes.insert(*file.file_node.hash());

Apply to both locations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0de033 and 295c3d1.

📒 Files selected for processing (1)
  • oxen-rust/src/lib/src/core/v_latest/fetch.rs (7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (2)
oxen-rust/src/lib/src/model/merkle_tree/merkle_hash.rs (1)
  • new (18-20)
oxen-rust/src/lib/src/repositories/tree.rs (2)
  • get_subtree_by_depth_with_unique_children (367-390)
  • list_all_files (702-709)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Test Suite / Test Suite (macos-latest)
  • GitHub Check: Test Suite / Test Suite (windows-latest)
  • GitHub Check: Test Suite / Test Suite (ubuntu-latest)
  • GitHub Check: Lint / Cargo check, format, clippy + Ruff
🔇 Additional comments (3)
oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)

12-12: LGTM!

The MerkleHash import is correctly added and necessary for the new deduplication logic using HashSet<MerkleHash>.


292-298: LGTM!

Good approach to initialize the deduplication set from the HEAD commit. This ensures files already present locally are not re-downloaded.


367-393: No action needed. The function collect_missing_entries_for_subtree is private to this module (not pub), so it has no external callers. Both internal call sites within the same file are already updated with the new shared_hashes parameter.

Likely an incorrect or invalid review comment.

@subygan subygan force-pushed the API-110/oxen-fetch-dedup branch from 295c3d1 to a81a507 Compare December 22, 2025 16:54
@subygan subygan requested a review from jcelliott December 22, 2025 18:02
Copy link
Collaborator

@rpschoenburg rpschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit, I think this is good. The prior optimization (skipping loading parts of the merkle tree that have already been loaded into memory) is restored, and this still does the de-dup.

Considering the need for it in multiple places now, I'm thinking I'll go ahead and make a new tree loading method to collect all the file nodes when we're reading the tree into memory, which would eliminate the need for the tree traversal in collect_missing_entries. But, I think this is good for now

@subygan
Copy link
Collaborator Author

subygan commented Dec 22, 2025

@rpschoenburg exactly what I was thinking, we now have it in pull and push. so would be nice to have that.

@rpschoenburg rpschoenburg merged commit 45013d3 into main Dec 23, 2025
5 checks passed
@rpschoenburg rpschoenburg deleted the API-110/oxen-fetch-dedup branch December 23, 2025 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants