Skip to content

Conversation

@ranger-ross
Copy link
Member

@ranger-ross ranger-ross commented Oct 26, 2025

This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282

Previous attempt: #16089

Design decisions / rational

  • Using build unit level locking instead of a temporary working directory.
    • After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
    • The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
      • mtime changes in fingerprints due to uplifting/downlifting order
      • tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
    • The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
    • The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
  • Risk of too many locks (file descriptors)
    • On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
    • Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
    • There are a few mitigations I could think of for this problem (that are included in this PR)
      • Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
      • Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
        • This could be implemented using reference counting.
      • Falling back to coarse grain locking if some heuristic is not met

Implementation details

  • We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
    • This is needed to enable pipelined builds
  • We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
  • Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
  • Locking is disabled on network filesystems. (keeping existing behavior from Don't use flock on NFS mounts #2623)
  • cargo clean continues to use coarse grain locking for simplicity.
  • File descriptors
    • I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
    • If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
      • I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
    • While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
      • Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
  • We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

Open Questions

  • Losing the Blocking message (Implement fine grain locking for build-dir #16155 (comment))
  • Lock downgrading scheme relies on unspecified behavior, see Implement fine grain locking for build-dir #16155 (comment)
  • How do we want to handle locking on the artifact directory?
    • We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
    • One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
    • Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
    • Solved by Do not lock the artifact-dir for check builds #16230
  • What should our testing strategy for locking be?
    • My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
    • For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
    • It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
    • For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.
  • Should we expose an option in the future to allow users to force coarse grain locking?
    • In the event a user’s system can’t support fine grain locking for some reason, should we provide some way to control the locking mode like CARGO_BUILD_LOCKING_MODE=coarse?
    • If so, would this be something that would be temporary during the transition period or a permanent feature?
  • What should the heuristics to disable fine grain locking be?
    • Currently, it's if the max file descriptors are less than 10 times the number of build units. This is pretty conservative. From my testing, it generally peaks around 2 times the number of build units.
    • I wonder if there is any other information in the crate graph that we could as a heuristic?

@rustbot
Copy link
Collaborator

rustbot commented Oct 26, 2025

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-build-execution Area: anything dealing with executing the compiler A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization Command-clean S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 26, 2025
Comment on lines +53 to +55
/// Coarse grain locking (Profile level)
Coarse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtime changes in fingerprints due to uplifting/downlifting order

This will also be a concern for #5931

For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't quiet thought through how we will handle mtime for the artifact cache.

Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that mtimes are still used for build scripts

Comment on lines +53 to +55
/// Coarse grain locking (Profile level)
Coarse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)

Do we hold any locks during test execution today? I'm not aware of any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.

Comment on lines 133 to 143
let primary_lock = open_file(&self.primary)?;
primary_lock.lock()?;

let secondary_lock = open_file(&self.secondary)?;
secondary_lock.lock()?;

self.guard = Some(UnitLockGuard {
primary: primary_lock,
_secondary: Some(secondary_lock),
});
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we double checked if we run into problems like #15698?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.

But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.

Comment on lines 167 to 205
pub fn downgrade(&mut self) -> CargoResult<()> {
let guard = self
.guard
.as_ref()
.context("guard was None while calling downgrade")?;

// NOTE:
// > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode.
// https://man7.org/linux/man-pages/man2/flock.2.html
//
// However, the `std::file::File::lock/lock_shared` is allowed to change this in the
// future. So its probably up to us if we are okay with using this or if we want to use a
// different interface to flock.
guard.primary.lock_shared()?;

Ok(())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows

Copy link
Member Author

@ranger-ross ranger-ross Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.

side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)

Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this to an Unresolved issue.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from b234db5 to 1deca86 Compare October 28, 2025 07:17
@rustbot

This comment has been minimized.

@ranger-ross ranger-ross changed the title Implement fine grain locking for the target directory Implement fine grain locking for build-dir Nov 22, 2025
@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from 4daae0d to e0f905c Compare November 22, 2025 14:16
@ranger-ross
Copy link
Member Author

Finally coming back to this PR now that we have merged #16230.
I rebased to resolve the conflicts and responded to some of the open threads.

@rustbot

This comment has been minimized.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from e0f905c to 5b36e7c Compare November 24, 2025 08:33
@rustbot

This comment has been minimized.

@epage
Copy link
Contributor

epage commented Nov 24, 2025

In general, something I realized we need to watch out for is the exact circumstances of why we don't always use -Cextra-filename, whether it is about file names or the entire path. This means moving of built content could cause problems

// No metadata in these cases:
//
// - dylibs:
// - if any dylib names are encoded in executables, so they can't be renamed.
// - TODO: Maybe use `-install-name` on macOS or `-soname` on other UNIX systems
// to specify the dylib name to be used by the linker instead of the filename.
// - Windows MSVC executables: The path to the PDB is embedded in the
// executable, and we don't want the PDB path to include the hash in it.
// - wasm32-unknown-emscripten executables: When using emscripten, the path to the
// .wasm file is embedded in the .js file, so we don't want the hash in there.

@rustbot

This comment has been minimized.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from 5b36e7c to 083d4b3 Compare December 3, 2025 14:11
@rustbot

This comment has been minimized.

@rustbot

This comment has been minimized.

@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from 083d4b3 to ec9f6bc Compare December 4, 2025 13:26
@rustbot
Copy link
Collaborator

rustbot commented Dec 4, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

This commit adds handling to attempt to increase the max file
descriptors if we detect that there is a risk of hitting the limit.

If we cannot increase the max file descriptors, we fall back to coarse
grain locking as that is better than a build crashing due to resource
limits.
@ranger-ross ranger-ross force-pushed the multi-locking-attempt branch from ec9f6bc to 8252855 Compare December 4, 2025 13:28
@ranger-ross
Copy link
Member Author

@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬.

So I think this PR is now good to be reviewed.

@rustbot
Copy link
Collaborator

rustbot commented Dec 4, 2025

☔ The latest upstream changes (possibly 1d3fb78) made this pull request unmergeable. Please resolve the merge conflicts.

Comment on lines +366 to +368
// TODO: We should probably revalidate the fingerprint here as another Cargo instance could
// have already compiled the crate before we recv'd the lock.
// For large crates re-compiling here would be quiet costly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this TODO out of the code to a place we can track

Copy link
Contributor

@epage epage Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably revalidate the fingerprint here as another Cargo instance could

Or we move the lock acquisition up a level to be around the fingerprinting (since that has us read the build unit) and then we move it into the job

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good point. I over looked locking during the fingerprint read 😓
Even if a unit is fresh, we still need to take a lock while evaluating the fingerprint.

I suppose this shouldn't be too problematic as if the unit is fresh we'd immediate unlock that unit so lock contention low.
I think that change should not be too difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16155 (comment) ties into this and can also have a major affect on the design

Comment on lines +1004 to +1007
let mut lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking
&& matches!(build_runner.locking_mode, LockingMode::Fine)
{
Some(CompilationLock::new(build_runner, unit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a build_runner.unit_lock(unit)

target: Option<CompileTarget>,
dest: &str,
must_take_artifact_dir_lock: bool,
build_dir_locking_mode: &LockingMode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we take a reference to something that could be made Copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, I'll make it Copy :D

Ok(())
}

pub fn downgrade(&mut self) -> CargoResult<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify what this means? maybe downgrade_partial?

Comment on lines 138 to 173
pub fn lock_exclusive(&mut self) -> CargoResult<()> {
assert!(self.guard.is_none());

let partial = open_file(&self.partial)?;
partial.lock()?;

let full = open_file(&self.full)?;
full.lock()?;

self.guard = Some(UnitLockGuard {
partial,
_full: Some(full),
});
Ok(())
}

pub fn lock_shared(&mut self, ty: &SharedLockType) -> CargoResult<()> {
assert!(self.guard.is_none());

let partial = open_file(&self.partial)?;
partial.lock_shared()?;

let full = if matches!(ty, SharedLockType::Full) {
let full_lock = open_file(&self.full)?;
full_lock.lock_shared()?;
Some(full_lock)
} else {
None
};

self.guard = Some(UnitLockGuard {
partial,
_full: full,
});
Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Filessytems locking, this doesn't provide a way to find out what you are blocked on or that we are even blocked.

I doubt we can send a Blocking message to the user within our current progress system though that is something i want to eventually redesign.

Can we at least log a message saying what is being blocked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can add logging.

I also looked into providing feedback but:

  1. Was a bit tricky as gctx is not available in the Work units that are executed by the job queue, which I believe is the primary interface to shell output
  2. I wasn't quiet sure what the best way to present this info to the user. I was worried about potentially flooding the screen with messages as units are unlocked and new units get blocked.

/// This lock is designed to reduce file descriptors by sharing a single file descriptor for a
/// given lock when the lock is shared. The motivation for this is to avoid hitting file descriptor
/// limits when fine grain locking is enabled.
pub struct RcFileLock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of complexity when we aren't needing to lock within our own process?

What if we tracked all locks inside of the BuildRunner? We could have a single lock per build unit that we grab exclusively as soon as we know the dep unit path, store them in a HashMap<PathBuf, FileLock> (either using Filesystem or added a public constructor for FileLock), and hold onto them until the end of the build.

At least for a first step, it simplifies things a lot. It does mean that another build will block until this one is done if they share some build units but not all. That won't be the case for cargo check vs cargo build or for cargo check vs cargo clippy. It will be an issue for cargo check vs cargo check --no-default-features or cargo check vs cargo check --workspace. We can at least defer that out of this initial PR and evaluate both different multi-lock designs under these scheme and how much of a need there is for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we tracked all locks inside of the BuildRunner?

We could try something like this but the .rmeta_produce() called for pipelined builds makes this tricky since build_runner is not in scope for the Work closure. (similar issue as this comment)

Though maybe it might be possible to plumb that over to the JobState.
Unsure how difficult that would be but I can look into it

Copy link
Contributor

@epage epage Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting we simplify things down to just one lock per build unit. We grab it when doing the fingerprint check and then hold onto it until the end. We don't need these locks for coordination within our own build, this is just for cross-process coordination. If you have two processed doing cargo check && cargo check, the second one will effectively be blocked on the first anyways. Having finer grained locks than this only helps when some leaf crates can be shared but nothing else, like what happens when different features are activated. This seems minor enough especially when we get the cross-project build cache which is where these are more likely to live and will have a different locking scheme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-build-execution Area: anything dealing with executing the compiler A-documenting-cargo-itself Area: Cargo's documentation A-filesystem Area: issues with filesystems A-layout Area: target output directory layout, naming, and organization A-unstable Area: nightly unstable support Command-clean S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants