-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement fine grain locking for build-dir
#16155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtime changes in fingerprints due to uplifting/downlifting order
This will also be a concern for #5931
For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quiet thought through how we will handle mtime for the artifact cache.
Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that mtimes are still used for build scripts
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
Do we hold any locks during test execution today? I'm not aware of any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.
src/cargo/core/compiler/locking.rs
Outdated
| let primary_lock = open_file(&self.primary)?; | ||
| primary_lock.lock()?; | ||
|
|
||
| let secondary_lock = open_file(&self.secondary)?; | ||
| secondary_lock.lock()?; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| primary: primary_lock, | ||
| _secondary: Some(secondary_lock), | ||
| }); | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we double checked if we run into problems like #15698?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.
But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.
| pub fn downgrade(&mut self) -> CargoResult<()> { | ||
| let guard = self | ||
| .guard | ||
| .as_ref() | ||
| .context("guard was None while calling downgrade")?; | ||
|
|
||
| // NOTE: | ||
| // > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode. | ||
| // https://man7.org/linux/man-pages/man2/flock.2.html | ||
| // | ||
| // However, the `std::file::File::lock/lock_shared` is allowed to change this in the | ||
| // future. So its probably up to us if we are okay with using this or if we want to use a | ||
| // different interface to flock. | ||
| guard.primary.lock_shared()?; | ||
|
|
||
| Ok(()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.
side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)
Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to an Unresolved issue.
b234db5 to
1deca86
Compare
eed6ccd to
4daae0d
Compare
This comment has been minimized.
This comment has been minimized.
target directorybuild-dir
4daae0d to
e0f905c
Compare
|
Finally coming back to this PR now that we have merged #16230. |
This comment has been minimized.
This comment has been minimized.
e0f905c to
5b36e7c
Compare
This comment has been minimized.
This comment has been minimized.
|
In general, something I realized we need to watch out for is the exact circumstances of why we don't always use cargo/src/cargo/core/compiler/build_runner/compilation_files.rs Lines 873 to 882 in 8e43074
|
This comment has been minimized.
This comment has been minimized.
5b36e7c to
083d4b3
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
083d4b3 to
ec9f6bc
Compare
|
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
This commit adds handling to attempt to increase the max file descriptors if we detect that there is a risk of hitting the limit. If we cannot increase the max file descriptors, we fall back to coarse grain locking as that is better than a build crashing due to resource limits.
ec9f6bc to
8252855
Compare
|
@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬. So I think this PR is now good to be reviewed. |
|
☔ The latest upstream changes (possibly 1d3fb78) made this pull request unmergeable. Please resolve the merge conflicts. |
| // TODO: We should probably revalidate the fingerprint here as another Cargo instance could | ||
| // have already compiled the crate before we recv'd the lock. | ||
| // For large crates re-compiling here would be quiet costly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this TODO out of the code to a place we can track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably revalidate the fingerprint here as another Cargo instance could
Or we move the lock acquisition up a level to be around the fingerprinting (since that has us read the build unit) and then we move it into the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm good point. I over looked locking during the fingerprint read 😓
Even if a unit is fresh, we still need to take a lock while evaluating the fingerprint.
I suppose this shouldn't be too problematic as if the unit is fresh we'd immediate unlock that unit so lock contention low.
I think that change should not be too difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16155 (comment) ties into this and can also have a major affect on the design
| let mut lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking | ||
| && matches!(build_runner.locking_mode, LockingMode::Fine) | ||
| { | ||
| Some(CompilationLock::new(build_runner, unit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a build_runner.unit_lock(unit)
| target: Option<CompileTarget>, | ||
| dest: &str, | ||
| must_take_artifact_dir_lock: bool, | ||
| build_dir_locking_mode: &LockingMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we take a reference to something that could be made Copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, I'll make it Copy :D
| Ok(()) | ||
| } | ||
|
|
||
| pub fn downgrade(&mut self) -> CargoResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify what this means? maybe downgrade_partial?
| pub fn lock_exclusive(&mut self) -> CargoResult<()> { | ||
| assert!(self.guard.is_none()); | ||
|
|
||
| let partial = open_file(&self.partial)?; | ||
| partial.lock()?; | ||
|
|
||
| let full = open_file(&self.full)?; | ||
| full.lock()?; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| partial, | ||
| _full: Some(full), | ||
| }); | ||
| Ok(()) | ||
| } | ||
|
|
||
| pub fn lock_shared(&mut self, ty: &SharedLockType) -> CargoResult<()> { | ||
| assert!(self.guard.is_none()); | ||
|
|
||
| let partial = open_file(&self.partial)?; | ||
| partial.lock_shared()?; | ||
|
|
||
| let full = if matches!(ty, SharedLockType::Full) { | ||
| let full_lock = open_file(&self.full)?; | ||
| full_lock.lock_shared()?; | ||
| Some(full_lock) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| partial, | ||
| _full: full, | ||
| }); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike Filessytems locking, this doesn't provide a way to find out what you are blocked on or that we are even blocked.
I doubt we can send a Blocking message to the user within our current progress system though that is something i want to eventually redesign.
Can we at least log a message saying what is being blocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I can add logging.
I also looked into providing feedback but:
- Was a bit tricky as
gctxis not available in theWorkunits that are executed by the job queue, which I believe is the primary interface to shell output - I wasn't quiet sure what the best way to present this info to the user. I was worried about potentially flooding the screen with messages as units are unlocked and new units get blocked.
| /// This lock is designed to reduce file descriptors by sharing a single file descriptor for a | ||
| /// given lock when the lock is shared. The motivation for this is to avoid hitting file descriptor | ||
| /// limits when fine grain locking is enabled. | ||
| pub struct RcFileLock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of complexity when we aren't needing to lock within our own process?
What if we tracked all locks inside of the BuildRunner? We could have a single lock per build unit that we grab exclusively as soon as we know the dep unit path, store them in a HashMap<PathBuf, FileLock> (either using Filesystem or added a public constructor for FileLock), and hold onto them until the end of the build.
At least for a first step, it simplifies things a lot. It does mean that another build will block until this one is done if they share some build units but not all. That won't be the case for cargo check vs cargo build or for cargo check vs cargo clippy. It will be an issue for cargo check vs cargo check --no-default-features or cargo check vs cargo check --workspace. We can at least defer that out of this initial PR and evaluate both different multi-lock designs under these scheme and how much of a need there is for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we tracked all locks inside of the
BuildRunner?
We could try something like this but the .rmeta_produce() called for pipelined builds makes this tricky since build_runner is not in scope for the Work closure. (similar issue as this comment)
Though maybe it might be possible to plumb that over to the JobState.
Unsure how difficult that would be but I can look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting we simplify things down to just one lock per build unit. We grab it when doing the fingerprint check and then hold onto it until the end. We don't need these locks for coordination within our own build, this is just for cross-process coordination. If you have two processed doing cargo check && cargo check, the second one will effectively be blocked on the first anyways. Having finer grained locks than this only helps when some leaf crates can be shared but nothing else, like what happens when different features are activated. This seems minor enough especially when we get the cross-project build cache which is where these are more likely to live and will have a different locking scheme.
This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282
Previous attempt: #16089
Design decisions / rational
Implementation details
primary.lockandsecondary.lock(seelocking.rsmodule docs for more details on the states)determine_locking_mode()).cargo-locklock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.cargo cleancontinues to use coarse grain locking for simplicity.UnitGraph.number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.RcFileLockI was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.FileLockInternerthat holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it toJobStateif preferred, it would just be a bit more plumbing)Open Questions
build-dir#16155 (comment))build-dir#16155 (comment)cargo doccargo checkwould not take a lock artifact-dir butcargo buildwould). This would mean that 2cargo buildinvocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.-Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.CARGO_BUILD_LOCKING_MODE=coarse?