Deduplication during pull #225

subygan · 2025-12-19T16:38:58Z

Summary by CodeRabbit

Refactor
- Fetch/deduplication now tracks file hashes across subtree traversal, reducing duplicate downloads and improving total-byte accuracy and fetch performance.
Chores
- Archive creation now uses a faster compression setting, producing archives quicker (with different compression trade-offs).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-19T16:39:16Z

Walkthrough

Reworks missing-entry collection in fetch.rs: adds MerkleHash to public imports and signatures, introduces a shared file_hashes_seen: HashSet<MerkleHash> for deduplication, removes per-subtree path param, and switches traversal to node-based Merkle tree walking with CommitEntry::from_node.

Changes

Cohort / File(s)	Summary
Fetch dedupe & traversal `oxen-rust/src/lib/src/core/v_latest/fetch.rs`	Added `MerkleHash` to imports and public signatures; replaced `subtree_path` param with shared `file_hashes_seen: &mut HashSet<MerkleHash>`; changed subtree processing to walk Merkle nodes and create `CommitEntry`s from nodes (`CommitEntry::from_node`); dedupe by merkle hash and update `missing_entries`/`total_bytes` only for unseen hashes.
Tree compression change `oxen-rust/src/lib/src/repositories/tree.rs`	Switched gzip compression setting to `Compression::fast()` in `compress_tree`, `compress_nodes`, `compress_node`, and `compress_commits`, altering encoder performance/level while keeping APIs and control flow unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Verify all call sites for collect_missing_entries_for_subtree updated to new signature and lifetimes.
Confirm deduplication correctness: file_hashes_seen is updated and prevents duplicate entries across subtrees.
Inspect CommitEntry::from_node usage and MerkleTreeNodeType handling (File vs FileChunk).
Check archive outputs for intended effects of Compression::fast() where compression ratio/throughput matters.

Possibly related PRs

FIX: Push uploading redundant files #207 — Adds MerkleHash-based per-file deduplication in related push/fetch paths; touches similar dedupe structures.
Oxen-AI/Oxen#634 — Modifies core/v_latest/fetch.rs and collect_missing_entries_for_subtree to incorporate Merkle-hash dedupe and signature changes.
Refactor CommitMerkleTree and repositories::tree #176 — Changes Merkle-tree traversal and introduces file_hashes_seen patterns in fetch-related code paths.

Suggested reviewers

rpschoenburg

Poem

"I hopped through nodes, a nimble lab,
Collected hashes from each slab,
One set to banish twin-file frights,
I stitch the tree through days and nights,
— your rabbit reviewer, hat and map 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Deduplication during pull' directly aligns with the main changes: introducing deduplication logic (file_hashes_seen tracking) in the fetch pull operation and optimizing compression performance.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch API-110/oxen-fetch-dedup

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a81a507 and 9663243.

📒 Files selected for processing (2)

oxen-rust/src/lib/src/core/v_latest/fetch.rs
oxen-rust/src/lib/src/repositories/tree.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint / Cargo check, format, clippy + Ruff
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Test Suite / Test Suite (ubuntu-latest)

🔇 Additional comments (6)

oxen-rust/src/lib/src/repositories/tree.rs (2)

59-59: LGTM: Documentation improvements.

The doc comment clarifications better describe the hash collection behavior.

Also applies to: 81-81

889-889: Verify the compression speed vs. size trade-off.

The switch from Compression::default() to Compression::fast() prioritizes compression speed over compression ratio. This will produce larger archives but compress faster.

Ensure this trade-off aligns with your performance goals, especially considering:

Network transfer time for larger archives

Storage requirements

Overall end-to-end fetch/pull performance

Also applies to: 930-930, 964-964, 990-990

oxen-rust/src/lib/src/core/v_latest/fetch.rs (4)

11-12: LGTM: Import additions support new deduplication logic.

The added imports (EMerkleTreeNode, MerkleTreeNode, MerkleHash) are correctly used in the refactored collect_missing_entries_for_subtree function.

256-256: LGTM: Cross-commit deduplication via shared hash set.

Introducing file_hashes_seen enables deduplication of file entries across multiple commits, ensuring each unique file hash is downloaded only once to the versions directory.

281-286: LGTM: Function signature updated for deduplication.

The signature change removes the unused subtree_path parameter and adds file_hashes_seen to enable cross-commit file deduplication. Call sites are correctly updated.

Also applies to: 306-311, 317-321

323-338: Refactored code creates CommitEntry instances with incomplete path information.

The refactored collect_missing_entries_for_subtree uses tree.walk_tree() which does not track directory context. Consequently, CommitEntry::from_node(&node.node) at line 331 creates entries with only the filename in the path field (via PathBuf::from(file_node.name())), not the full directory path.

Unlike the recursive r_download_entries pattern elsewhere in the same file, which explicitly tracks directory context through recursion and constructs complete paths via directory.join(file_node.name()), this refactored approach cannot reconstruct full paths. While version storage is hash-based and the fetch operation functions correctly, the incomplete path field violates semantic correctness and could cause issues if the path is used in logging, debugging, or downstream operations beyond the immediate fetch context.

Consider tracking directory context during tree traversal or restructuring the callback to pass directory information.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)
320-332: Duplicate tree traversal: list_all_files is called twice per subtree.

list_all_files is invoked in collect_missing_entries_for_subtree (line 374) and again here (line 329). Consider returning the collected hashes from collect_missing_entries_for_subtree to avoid the redundant traversal.
🔎 Suggested approach

Modify collect_missing_entries_for_subtree to return or populate the new file hashes it encounters:
fn collect_missing_entries_for_subtree(
    tree: &MerkleTreeNode,
    subtree_path: &PathBuf,
    missing_entries: &mut HashSet<Entry>,
    total_bytes: &mut u64,
    shared_hashes: &HashSet<MerkleHash>,
    new_hashes: &mut HashSet<MerkleHash>,  // collect newly seen hashes
) -> Result<(), OxenError>
Then the caller can extend shared_file_hashes with new_hashes without re-traversing the tree.
349-361: Same duplicate traversal issue applies here.

For consistency with the earlier suggestion, this branch should also be refactored to avoid calling list_all_files twice per commit.

232-281: Remove redundant .clone() on MerkleHash.

MerkleHash derives Copy, making .clone() on lines 257 and 276 unnecessary. Use dereference instead:
-            file_hashes.insert(file.file_node.hash().clone());
+            file_hashes.insert(*file.file_node.hash());
Apply to both locations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0de033 and 295c3d1.

📒 Files selected for processing (1)

oxen-rust/src/lib/src/core/v_latest/fetch.rs (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

oxen-rust/src/lib/src/core/v_latest/fetch.rs (2)

oxen-rust/src/lib/src/model/merkle_tree/merkle_hash.rs (1)

new (18-20)

oxen-rust/src/lib/src/repositories/tree.rs (2)

get_subtree_by_depth_with_unique_children (367-390)

list_all_files (702-709)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Test Suite / Test Suite (macos-latest)
GitHub Check: Test Suite / Test Suite (windows-latest)
GitHub Check: Test Suite / Test Suite (ubuntu-latest)
GitHub Check: Lint / Cargo check, format, clippy + Ruff

🔇 Additional comments (3)

oxen-rust/src/lib/src/core/v_latest/fetch.rs (3)

12-12: LGTM!

The MerkleHash import is correctly added and necessary for the new deduplication logic using HashSet<MerkleHash>.

292-298: LGTM!

Good approach to initialize the deduplication set from the HEAD commit. This ensures files already present locally are not re-downloaded.

367-393: No action needed. The function collect_missing_entries_for_subtree is private to this module (not pub), so it has no external callers. Both internal call sites within the same file are already updated with the new shared_hashes parameter.

Likely an incorrect or invalid review comment.

oxen-rust/src/lib/src/core/v_latest/fetch.rs

rpschoenburg

With the latest commit, I think this is good. The prior optimization (skipping loading parts of the merkle tree that have already been loaded into memory) is restored, and this still does the de-dup.

Considering the need for it in multiple places now, I'm thinking I'll go ahead and make a new tree loading method to collect all the file nodes when we're reading the tree into memory, which would eliminate the need for the tree traversal in collect_missing_entries. But, I think this is good for now

subygan · 2025-12-22T18:26:37Z

@rpschoenburg exactly what I was thinking, we now have it in pull and push. so would be nice to have that.

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

jcelliott requested a review from rpschoenburg December 19, 2025 22:05

jcelliott reviewed Dec 19, 2025

View reviewed changes

oxen-rust/src/lib/src/core/v_latest/fetch.rs Outdated Show resolved Hide resolved

Fix fetch deduplication

a81a507

subygan force-pushed the API-110/oxen-fetch-dedup branch from 295c3d1 to a81a507 Compare December 22, 2025 16:54

reduce compression

9663243

subygan requested a review from jcelliott December 22, 2025 18:02

rpschoenburg approved these changes Dec 22, 2025

View reviewed changes

rpschoenburg merged commit 45013d3 into main Dec 23, 2025
5 checks passed

rpschoenburg deleted the API-110/oxen-fetch-dedup branch December 23, 2025 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplication during pull #225

Deduplication during pull #225

Uh oh!

subygan commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

rpschoenburg left a comment

Uh oh!

subygan commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Deduplication during pull #225

Deduplication during pull #225

Uh oh!

Conversation

subygan commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rpschoenburg left a comment

Choose a reason for hiding this comment

Uh oh!

subygan commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

subygan commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading