Implement fine grain locking for `build-dir` #16155

ranger-ross · 2025-10-26T13:28:06Z

This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282

Previous attempt: #16089

Design decisions / rational

Using build unit level locking instead of a temporary working directory.
- After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
- The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
  - mtime changes in fingerprints due to uplifting/downlifting order
  - tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
- The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
- The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
Risk of too many locks (file descriptors)
- On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
- Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
- There are a few mitigations I could think of for this problem (that are included in this PR)
  - Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
  - Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
    - This could be implemented using reference counting.
  - Falling back to coarse grain locking if some heuristic is not met

Implementation details

We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
- This is needed to enable pipelined builds
We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
Locking is disabled on network filesystems. (keeping existing behavior from Don't use flock on NFS mounts #2623)
cargo clean continues to use coarse grain locking for simplicity.
File descriptors
- I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
- If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
  - I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
- While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
  - Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

Open Questions

Losing the Blocking message (Implement fine grain locking for build-dir #16155 (comment))
Lock downgrading scheme relies on unspecified behavior, see Implement fine grain locking for build-dir #16155 (comment)
How do we want to handle locking on the artifact directory?
- We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
- One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
- Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
- Solved by Do not lock the artifact-dir for check builds #16230
What should our testing strategy for locking be?
- My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
- For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
- It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
- For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.
Should we expose an option in the future to allow users to force coarse grain locking?
- In the event a user’s system can’t support fine grain locking for some reason, should we provide some way to control the locking mode like CARGO_BUILD_LOCKING_MODE=coarse?
- If so, would this be something that would be temporary during the transition period or a permanent feature?
What should the heuristics to disable fine grain locking be?
- Currently, it's if the max file descriptors are less than 10 times the number of build units. This is pretty conservative. From my testing, it generally peaks around 2 times the number of build units.
- I wonder if there is any other information in the crate graph that we could as a heuristic?

rustbot · 2025-10-26T13:28:10Z

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/build_runner/mod.rs

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/locking.rs

epage · 2025-10-27T19:31:40Z

src/cargo/core/compiler/locking.rs

+    /// Coarse grain locking (Profile level)
+    Coarse,


mtime changes in fingerprints due to uplifting/downlifting order

This will also be a concern for #5931

For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.

I haven't quiet thought through how we will handle mtime for the artifact cache.

Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...

Note that mtimes are still used for build scripts

epage · 2025-10-27T19:34:57Z

src/cargo/core/compiler/locking.rs

+    /// Coarse grain locking (Profile level)
+    Coarse,


tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)

Do we hold any locks during test execution today? I'm not aware of any.

hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/locking.rs

epage · 2025-10-27T20:07:09Z

src/cargo/core/compiler/locking.rs

+        let primary_lock = open_file(&self.primary)?;
+        primary_lock.lock()?;
+
+        let secondary_lock = open_file(&self.secondary)?;
+        secondary_lock.lock()?;
+
+        self.guard = Some(UnitLockGuard {
+            primary: primary_lock,
+            _secondary: Some(secondary_lock),
+        });
+        Ok(())


Have we double checked if we run into problems like #15698?

I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.

But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.

epage · 2025-10-27T20:09:24Z

src/cargo/core/compiler/locking.rs

+    pub fn downgrade(&mut self) -> CargoResult<()> {
+        let guard = self
+            .guard
+            .as_ref()
+            .context("guard was None while calling downgrade")?;
+
+        // NOTE:
+        // > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode.
+        // https://man7.org/linux/man-pages/man2/flock.2.html
+        //
+        // However, the `std::file::File::lock/lock_shared` is allowed to change this in the
+        // future. So its probably up to us if we are okay with using this or if we want to use a
+        // different interface to flock.
+        guard.primary.lock_shared()?;
+
+        Ok(())
+    }
+}


We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows

We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.

side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)

Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?

We can move this to an Unresolved issue.

src/cargo/core/compiler/locking.rs

src/cargo/core/compiler/build_runner/mod.rs

src/cargo/util/rlimit.rs

ranger-ross · 2025-11-22T15:00:34Z

Finally coming back to this PR now that we have merged #16230.
I rebased to resolve the conflicts and responded to some of the open threads.

epage · 2025-11-24T21:02:46Z

In general, something I realized we need to watch out for is the exact circumstances of why we don't always use -Cextra-filename, whether it is about file names or the entire path. This means moving of built content could cause problems

cargo/src/cargo/core/compiler/build_runner/compilation_files.rs

Lines 873 to 882 in 8e43074

    
           // No metadata in these cases: 
        
           // 
        
           // - dylibs: 
        
           //   - if any dylib names are encoded in executables, so they can't be renamed. 
        
           //   - TODO: Maybe use `-install-name` on macOS or `-soname` on other UNIX systems 
        
           //     to specify the dylib name to be used by the linker instead of the filename. 
        
           // - Windows MSVC executables: The path to the PDB is embedded in the 
        
           //   executable, and we don't want the PDB path to include the hash in it. 
        
           // - wasm32-unknown-emscripten executables: When using emscripten, the path to the 
        
           //   .wasm file is embedded in the .js file, so we don't want the hash in there.

rustbot · 2025-12-04T13:26:32Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

This commit adds handling to attempt to increase the max file descriptors if we detect that there is a risk of hitting the limit. If we cannot increase the max file descriptors, we fall back to coarse grain locking as that is better than a build crashing due to resource limits.

ranger-ross · 2025-12-04T13:33:52Z

@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬.

So I think this PR is now good to be reviewed.

rustbot · 2025-12-04T15:48:59Z

☔ The latest upstream changes (possibly 1d3fb78) made this pull request unmergeable. Please resolve the merge conflicts.

epage · 2025-12-04T16:05:43Z

src/cargo/core/compiler/mod.rs

+            // TODO: We should probably revalidate the fingerprint here as another Cargo instance could
+            // have already compiled the crate before we recv'd the lock.
+            // For large crates re-compiling here would be quiet costly.


Let's move this TODO out of the code to a place we can track

We should probably revalidate the fingerprint here as another Cargo instance could

Or we move the lock acquisition up a level to be around the fingerprinting (since that has us read the build unit) and then we move it into the job

hmmm good point. I over looked locking during the fingerprint read 😓
Even if a unit is fresh, we still need to take a lock while evaluating the fingerprint.

I suppose this shouldn't be too problematic as if the unit is fresh we'd immediate unlock that unit so lock contention low.
I think that change should not be too difficult.

#16155 (comment) ties into this and can also have a major affect on the design

epage · 2025-12-04T16:06:11Z

src/cargo/core/compiler/mod.rs

+    let mut lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking
+        && matches!(build_runner.locking_mode, LockingMode::Fine)
+    {
+        Some(CompilationLock::new(build_runner, unit))


Can we have a build_runner.unit_lock(unit)

epage · 2025-12-04T16:14:28Z

src/cargo/core/compiler/layout.rs

        target: Option<CompileTarget>,
        dest: &str,
        must_take_artifact_dir_lock: bool,
+        build_dir_locking_mode: &LockingMode,


Why do we take a reference to something that could be made Copy?

no reason, I'll make it Copy :D

epage · 2025-12-04T16:18:03Z

src/cargo/core/compiler/locking.rs

+        Ok(())
+    }
+
+    pub fn downgrade(&mut self) -> CargoResult<()> {


Can we clarify what this means? maybe downgrade_partial?

epage · 2025-12-04T16:23:11Z

src/cargo/core/compiler/locking.rs

+    pub fn lock_exclusive(&mut self) -> CargoResult<()> {
+        assert!(self.guard.is_none());
+
+        let partial = open_file(&self.partial)?;
+        partial.lock()?;
+
+        let full = open_file(&self.full)?;
+        full.lock()?;
+
+        self.guard = Some(UnitLockGuard {
+            partial,
+            _full: Some(full),
+        });
+        Ok(())
+    }
+
+    pub fn lock_shared(&mut self, ty: &SharedLockType) -> CargoResult<()> {
+        assert!(self.guard.is_none());
+
+        let partial = open_file(&self.partial)?;
+        partial.lock_shared()?;
+
+        let full = if matches!(ty, SharedLockType::Full) {
+            let full_lock = open_file(&self.full)?;
+            full_lock.lock_shared()?;
+            Some(full_lock)
+        } else {
+            None
+        };
+
+        self.guard = Some(UnitLockGuard {
+            partial,
+            _full: full,
+        });
+        Ok(())
+    }


Unlike Filessytems locking, this doesn't provide a way to find out what you are blocked on or that we are even blocked.

I doubt we can send a Blocking message to the user within our current progress system though that is something i want to eventually redesign.

Can we at least log a message saying what is being blocked?

sure, I can add logging.

I also looked into providing feedback but:

Was a bit tricky as gctx is not available in the Work units that are executed by the job queue, which I believe is the primary interface to shell output

I wasn't quiet sure what the best way to present this info to the user. I was worried about potentially flooding the screen with messages as units are unlocked and new units get blocked.

epage · 2025-12-04T16:35:47Z

src/cargo/core/compiler/locking.rs

+/// This lock is designed to reduce file descriptors by sharing a single file descriptor for a
+/// given lock when the lock is shared. The motivation for this is to avoid hitting file descriptor
+/// limits when fine grain locking is enabled.
+pub struct RcFileLock {


This is a lot of complexity when we aren't needing to lock within our own process?

What if we tracked all locks inside of the BuildRunner? We could have a single lock per build unit that we grab exclusively as soon as we know the dep unit path, store them in a HashMap<PathBuf, FileLock> (either using Filesystem or added a public constructor for FileLock), and hold onto them until the end of the build.

At least for a first step, it simplifies things a lot. It does mean that another build will block until this one is done if they share some build units but not all. That won't be the case for cargo check vs cargo build or for cargo check vs cargo clippy. It will be an issue for cargo check vs cargo check --no-default-features or cargo check vs cargo check --workspace. We can at least defer that out of this initial PR and evaluate both different multi-lock designs under these scheme and how much of a need there is for that.

What if we tracked all locks inside of the BuildRunner?

We could try something like this but the .rmeta_produce() called for pipelined builds makes this tricky since build_runner is not in scope for the Work closure. (similar issue as this comment)

Though maybe it might be possible to plumb that over to the JobState.
Unsure how difficult that would be but I can look into it

I was suggesting we simplify things down to just one lock per build unit. We grab it when doing the fingerprint check and then hold onto it until the end. We don't need these locks for coordination within our own build, this is just for cross-process coordination. If you have two processed doing cargo check && cargo check, the second one will effectively be blocked on the first anyways. Having finer grained locks than this only helps when some leaf crates can be shared but nothing else, like what happens when different features are activated. This seems minor enough especially when we get the cross-project build cache which is where these are more likely to live and will have a different locking scheme.

rustbot assigned ehuss Oct 26, 2025

ranger-ross mentioned this pull request Oct 26, 2025

(experiment) Fine grain locking #16089

Closed