Skip to content

Commit 51d91f6

Browse files
authored
feature/cancel-command (#593)
* add multi-run monitoring * update and add tests for monitor * update flowchart in docs for the monitor * update CHANGELOG * run fix-style * fix tests for py 3.8 * fix style * fix yet another issue with py3.8 * run fix-style * move db specific utils to their own utilities file * add status for runs and remove run_complete attribute * add merlin cancel command * update CHANGELOG * update docs * run fix style * fix broken tests * add line about disabling steps in the cancel process * add tests for the new files * fix style * change entity statuses to run_status and worker_status respectively * fix style * fix broken integration tests after latest changes
1 parent f2b7463 commit 51d91f6

File tree

32 files changed

+1946
-334
lines changed

32 files changed

+1946
-334
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
## [Unreleased]
88

99
### Added
10+
- New `merlin cancel <yaml>` command that cancels all runs of a current study
1011
- Monitor now checks each run of a study on every loop
1112
- Garbage collection functionality for the database. Called with: `merlin database gc`, `merlin database garbage-collect`, or `merlin database cleanup`
1213
- Built-in database garbage collection to the `merlin monitor`
1314
- Can be disabled with `--disable-gc`
1415
- Alias for `merlin database` command so it can be called with `merlin db`
16+
- Status of run entities in the database (this will differ from task statuses)
1517

1618
## [2.0.0b2]
1719

docs/user_guide/command_line.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -448,6 +448,7 @@ merlin server config [OPTIONS]
448448

449449
The Merlin library provides several commands for setting up and managing your Merlin workflow:
450450

451+
- *[cancel](#cancel-merlin-cancel)*: Cancel a study.
451452
- *[database](#database-merlin-database)*: Interact with Merlin's backend database
452453
- *[example](#example-merlin-example)*: Download pre-made workflow specifications that can be modified for your own workflow needs
453454
- *[purge](#purge-merlin-purge)*: Clear any tasks that are currently living in the central server
@@ -456,6 +457,68 @@ The Merlin library provides several commands for setting up and managing your Me
456457
- *[run workers](#run-workers-merlin-run-workers)*: Start up workers that will execute the tasks that exist on the central server
457458
- *[stop workers](#stop-workers-merlin-stop-workers)*: Stop existing workers
458459

460+
### Cancel (`merlin cancel`)
461+
462+
The `merlin cancel` command allows you to cancel a running study. A study may have multiple runs executing concurrently on multiple workers being watched by a monitor process to ensure every run finishes. This command will help ensure that every piece of this process is cancelled gracefully.
463+
464+
In other words, the `merlin cancel` command acts as a wrapper around the [`merlin purge`](#purge-merlin-purge) and [`merlin stop-workers`](#stop-workers-merlin-stop-workers) commands. Additionally, any running studies associated with the study being cancelled will be marked as cancelled in the database. This ensures that the `merlin monitor` command will no longer track these studies.
465+
466+
In short, this command will:
467+
468+
1. Purge the queues from the study (`merlin purge -f SPECIFICATION`)
469+
2. Stop the workers for that study (`merlin stop-workers --spec SPECIFICATON`)
470+
3. Mark all runs associated with that study as cancelled
471+
472+
Each of these options can be disabled with their respective options in the table below.
473+
474+
**Usage:**
475+
476+
```bash
477+
merlin cancel [OPTIONS] SPECIFICATION
478+
```
479+
480+
**Options:**
481+
482+
| Name | Type | Description | Default |
483+
| ------------------------- | ------- | ----------- | ------- |
484+
| `-h`, `--help` | boolean | Show this help message and exit | `False` |
485+
| `--no-purge` | boolean | Skip purging the queues for the study (skip step 1 above). | `False` |
486+
| `--no-stop-workers` | boolean | Skip stopping the workers for the study (skip step 2 above). | `False` |
487+
| `--no-mark-cancelled` | boolean | Skip marking runs as cancelled in the database (skip step 3 above). | `False` |
488+
| `--vars` | List[string] | A space-delimited list of variables to override in the spec file. This list should be given after the spec file is provided. Ex: `--vars LEARN=/path/to/new_learn.py EPOCHS=3` | None |
489+
490+
**Examples:**
491+
492+
!!! example "Basic Cancel Example"
493+
494+
```bash
495+
merlin cancel my_specification.yaml
496+
```
497+
498+
!!! example "Cancel Without Purging Queues Example"
499+
500+
```bash
501+
merlin cancel my_specification.yaml --no-purge
502+
```
503+
504+
!!! example "Cancel Without Stopping Workers Example"
505+
506+
```bash
507+
merlin cancel my_specification.yaml --no-stop-workers
508+
```
509+
510+
!!! example "Cancel Without Marking Runs as Cancelled Example"
511+
512+
```bash
513+
merlin cancel my_specification.yaml --no-mark-cancelled
514+
```
515+
516+
!!! example "Cancel and Substitute Variables Example"
517+
518+
```bash
519+
merlin cancel my_specification.yaml --vars CUSTOM_QUEUE=new_queue
520+
```
521+
459522
### Database (`merlin database`)
460523

461524
!!! note "Alias"

docs/user_guide/database/entities.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The `RunEntity` represents a single execution of a study. It captures the config
3636
| `workers` | `List[uuid4]` | List of [`LogicalWorker`](#logical-worker-entity) IDs serving tasks for this run. |
3737
| `parent` | `uuid4 \| NULL` | ID of parent run (if this run was started by another run). |
3838
| `child` | `uuid4 \| NULL` | ID of child run (if this run spawned a new run). |
39-
| `run_complete` | `bool` | Indicates whether the run has finished. |
39+
| `run_status` | `str` | The status of a run. |
4040
| `parameters` | `Dict` | Arbitrary key/value parameters provided to the run. |
4141
| `samples` | `Dict` | Arbitrary samples provided to the run. |
4242

@@ -46,6 +46,21 @@ The `RunEntity` represents a single execution of a study. It captures the config
4646
- Many-to-many with [`LogicalWorkerEntity`](#logical-worker-entity): Multiple runs can be linked to multiple logical workers.
4747
- Optional one-to-one with parent/child `RunEntity`: A single run can link to another run.
4848

49+
### Run Status
50+
51+
The `run_status` entry is *not* what [the status commands](../monitoring/status_cmds.md) are tracking. Those commands track step- and task-level statuses. This entry is tracking run-level status which becomes important for the [`merlin monitor`](../command_line.md#monitor-merlin-monitor) command.
52+
53+
Below is a table of possible statuses for a run.
54+
55+
| Status | Description |
56+
| ------------- | --------------------------------------------------------- |
57+
| `INITIALIZED` | Run has been created in the database but not queued. |
58+
| `QUEUED` | Run is queued on the task server and waiting to start. |
59+
| `RUNNING` | Run is currently executing. |
60+
| `COMPLETED` | Run has finished successfully. |
61+
| `CANCELLED` | Run was cancelled by the user. |
62+
| `FAILED` | Run hard failed due to an error. |
63+
4964
## Worker Entities
5065

5166
Merlin supports two distinct worker models: logical and physical. Logical workers define high-level behavior and configuration. Physical workers represent actual runtime processes launched from logical definitions. The below sections will go into further detail on both entities.
@@ -79,7 +94,7 @@ The `PhysicalWorkerEntity` represents an actual running instance of a worker pro
7994
| `launch_cmd` | `str` | Exact CLI used to start the worker. |
8095
| `args` | `Dict` | Additional runtime args or config passed to the worker process. |
8196
| `pid` | `str` | OS process ID in string format. |
82-
| `status` | `WorkerStatus` | Current status (e.g., `RUNNING`, `STOPPED`). |
97+
| `worker_status` | `WorkerStatus` | Current status of the worker (e.g., `RUNNING`, `STOPPED`). |
8398
| `heartbeat_timestamp` | `datetime` | Last time the worker checked in. |
8499
| `latest_start_time` | `datetime` | When this process was most recently (re)launched. |
85100
| `host` | `str` | Hostname or IP where this process is running. |

docs/user_guide/monitoring/monitor_for_allocation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ For each run of a study, the monitor ensures completion by performing the follow
2222

2323
The monitor includes a [`--sleep` option](#sleep), which introduces a deliberate delay. Before starting, the monitor waits for the specified `--sleep` duration, giving users time to populate the task queues for their run using the [`merlin run`](../command_line.md#run-merlin-run) command. Additionally, the monitor pauses for the `--sleep` duration between each check of the run. Finally, it will wait up to 10 times the specified `--sleep` duration for workers to spin up for the run.
2424

25-
A run is considered complete when the monitor reads the run's `run_complete` entry from the database and it returns `True`. This entry is always set as the final task of a run.
25+
A run is considered complete when the monitor reads the [run's `status` entry](../database/entities.md#run-status) from the database and it returns a finished status ("COMPLETED", "CANCELLED", or "FAILED"). This entry is always set as the final task of a run.
2626

2727
The resulting flowchart of this process can be seen below.
2828

merlin/celery.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ def handle_worker_shutdown(sender: str = None, **kwargs):
241241
merlin_db = MerlinDatabase()
242242
physical_worker = merlin_db.get("physical_worker", str(sender))
243243
if physical_worker:
244-
physical_worker.set_status(WorkerStatus.STOPPED)
244+
physical_worker.set_worker_status(WorkerStatus.STOPPED)
245245
physical_worker.set_pid(None) # Clear the pid
246246
else:
247247
LOG.warning(f"Worker {sender} not found in the database.")

merlin/cli/commands/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
for interacting with the Merlin database.
3434
"""
3535

36+
from merlin.cli.commands.cancel import CancelCommand
3637
from merlin.cli.commands.config import ConfigCommand
3738
from merlin.cli.commands.database import DatabaseCommand
3839
from merlin.cli.commands.example import ExampleCommand
@@ -51,6 +52,7 @@
5152

5253
# Keep these in alphabetical order
5354
ALL_COMMANDS = [
55+
CancelCommand(),
5456
ConfigCommand(),
5557
DatabaseCommand(),
5658
DetailedStatusCommand(),

merlin/cli/commands/cancel.py

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
##############################################################################
2+
# Copyright (c) Lawrence Livermore National Security, LLC and other Merlin
3+
# Project developers. See top-level LICENSE and COPYRIGHT files for dates and
4+
# other details. No copyright assignment is required to contribute to Merlin.
5+
##############################################################################
6+
7+
"""
8+
CLI module for cancelling running studies.
9+
10+
This module defines the `CancelCommand` class, which handles the `cancel` subcommand
11+
of the Merlin CLI. The `cancel` command is intended to terminate running studies
12+
gracefully. It acts as a wrapper around the `merlin purge` and `merlin stop-workers`
13+
commands. Additionally, any running studies associated with the study being cancelled
14+
will be marked as cancelled in the database. This ensures that the `merlin monitor`
15+
command will no longer track these studies.
16+
17+
In short, this command will:
18+
19+
1. Purge the queues from the study
20+
2. Stop the workers for that study
21+
3. Mark all runs associated with that study as cancelled
22+
"""
23+
24+
# pylint: disable=duplicate-code
25+
26+
import logging
27+
from argparse import ArgumentParser, Namespace
28+
29+
from merlin.cli.commands.command_entry_point import CommandEntryPoint
30+
from merlin.cli.utils import get_merlin_spec_with_override
31+
from merlin.study.manager import StudyManager
32+
33+
34+
LOG = logging.getLogger("merlin")
35+
36+
37+
class CancelCommand(CommandEntryPoint):
38+
"""
39+
Handles `cancel` CLI command for terminating running studies gracefully.
40+
41+
Methods:
42+
add_parser: Adds the `cancel` command to the CLI parser.
43+
process_command: Processes the CLI input and dispatches the appropriate action.
44+
"""
45+
46+
def add_parser(self, subparsers: ArgumentParser):
47+
"""
48+
Add the `cancel` command parser to the CLI argument parser.
49+
50+
Parameters:
51+
subparsers (ArgumentParser): The subparsers object to which the `cancel` command parser will be added.
52+
"""
53+
cancel: ArgumentParser = subparsers.add_parser(
54+
"cancel",
55+
help="Terminate running studies gracefully by purging queues, stopping workers, and marking runs as cancelled.",
56+
)
57+
cancel.set_defaults(func=self.process_command)
58+
59+
cancel.add_argument(
60+
"specification",
61+
type=str,
62+
help="Path to a Merlin YAML spec file for the study you want to cancel.",
63+
)
64+
65+
# Options to skip certain steps in the cancellation process
66+
cancel.add_argument(
67+
"--no-purge",
68+
action="store_true",
69+
help="Skip purging the queues for the study.",
70+
)
71+
cancel.add_argument(
72+
"--no-stop-workers",
73+
action="store_true",
74+
help="Skip stopping the workers for the study.",
75+
)
76+
cancel.add_argument(
77+
"--no-mark-cancelled",
78+
action="store_true",
79+
help="Skip marking runs as cancelled in the database.",
80+
)
81+
82+
# Option to substitute variables in the specification file
83+
cancel.add_argument(
84+
"--vars",
85+
action="store",
86+
dest="variables",
87+
type=str,
88+
nargs="+",
89+
default=None,
90+
help="Specify desired Merlin variable values to override those found in the specification. Space-delimited. "
91+
"Example: '--vars LEARN=path/to/new_learn.py EPOCHS=3'",
92+
)
93+
94+
def process_command(self, args: Namespace):
95+
"""
96+
CLI command to cancel a running study.
97+
98+
Args:
99+
args: Parsed CLI arguments.
100+
"""
101+
spec, _ = get_merlin_spec_with_override(args)
102+
103+
study_manager = StudyManager()
104+
result = study_manager.cancel(
105+
spec=spec,
106+
purge_queues=not args.no_purge,
107+
stop_workers=not args.no_stop_workers,
108+
mark_runs_cancelled=not args.no_mark_cancelled,
109+
)
110+
111+
# Print summary
112+
result_summary = (
113+
"\nCancellation Summary:\n"
114+
f" Study: {result['study_name']}\n"
115+
f" Runs cancelled: {result['runs_cancelled']}\n"
116+
f" Queues purged: {len(result['queues_purged'])}\n"
117+
f" Workers stopped: {len(result['workers_stopped'])}"
118+
)
119+
LOG.info(result_summary)

merlin/cli/commands/database/delete.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929

3030
from merlin.cli.commands.command_entry_point import CommandEntryPoint
3131
from merlin.cli.commands.database.entity_registry import ENTITY_REGISTRY
32-
from merlin.cli.utils import get_filters_for_entity, setup_db_entity_subcommands
32+
from merlin.cli.commands.database.utils import get_filters_for_entity, setup_db_entity_subcommands
3333
from merlin.config.configfile import initialize_config
3434
from merlin.db_scripts.merlin_db import MerlinDatabase
3535
from merlin.utils import get_singular_of_entity

merlin/cli/commands/database/entity_registry.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
"run": {
3636
"filters": [
3737
{"name": "study_id", "type": str},
38-
{"name": "run_complete", "type": bool},
38+
{"name": "status", "type": str},
3939
{"name": "queues", "type": str, "nargs": "+"},
4040
{"name": "workers", "type": str, "nargs": "+"},
4141
],

merlin/cli/commands/database/get.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828

2929
from merlin.cli.commands.command_entry_point import CommandEntryPoint
3030
from merlin.cli.commands.database.entity_registry import ENTITY_REGISTRY
31-
from merlin.cli.utils import get_filters_for_entity, setup_db_entity_subcommands
31+
from merlin.cli.commands.database.utils import get_filters_for_entity, setup_db_entity_subcommands
3232
from merlin.config.configfile import initialize_config
3333
from merlin.db_scripts.merlin_db import MerlinDatabase
3434
from merlin.utils import get_plural_of_entity, get_singular_of_entity

0 commit comments

Comments
 (0)