You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ibm_spectrum_lsf/README.md
+119-8Lines changed: 119 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,7 @@
4
4
5
5
This check monitors [IBM Spectrum LSF][1] using the Datadog Agent.
6
6
7
-
Include a high level overview of what this integration does:
8
-
- What does your product do (in 1-2 sentences)?
9
-
- What value will customers get from this integration, and why is it valuable to them?
10
-
- What specific data will your integration monitor, and what's the value of that data?
7
+
This integration gives an overview of the performance of your IBM Spectrum LSF environment. It also provides detailed information about running and completed jobs, slot utilization, and queues.
11
8
12
9
## Setup
13
10
@@ -16,18 +13,126 @@ Follow the instructions below to install and configure this check for an Agent r
16
13
### Installation
17
14
18
15
The IBM Spectrum LSF check is included in the [Datadog Agent][2] package.
19
-
No additional installation is needed on your server.
16
+
17
+
Install the Datadog Agent and configure the IBM Spectrum LSF check on the management host of your cluster. This integration will monitor the entire cluster.
18
+
19
+
#### On Linux
20
+
21
+
Add the `dd-agent` user as an LSF [administrator][10].
22
+
23
+
The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of:
Each time there is an Agent update, `/etc/init/datadog-agent.conf` is wiped and needs to be updated again.
20
86
21
87
### Configuration
22
88
23
89
1. Edit the `ibm_spectrum_lsf.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your `ibm_spectrum_lsf` performance data. See the [sample ibm_spectrum_lsf.d/conf.yaml][4] for all available configuration options.
24
90
91
+
The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics.
92
+
93
+
For example, if you would like to measure only GPU specific metrics, your metric sources will look like:
94
+
```
95
+
metric_sources:
96
+
- lsload_gpu
97
+
- bhosts_gpu
98
+
```
99
+
100
+
The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.
101
+
102
+
Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.
103
+
104
+
Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60.
105
+
106
+
Here is a sample configuration monitoring all available metrics:
107
+
108
+
```
109
+
instances:
110
+
- cluster_name: test-cluster
111
+
metric_sources:
112
+
- lsclusters
113
+
- lshosts
114
+
- bhosts
115
+
- lsload
116
+
- bqueues
117
+
- bslots
118
+
- bjobs
119
+
- lsload_gpu
120
+
- bhosts_gpu
121
+
- cluster_name: test-cluster
122
+
badmin_perfmon_auto: false
123
+
metric_sources:
124
+
- badmin_perfmon
125
+
- bhist
126
+
min_collection_interval: 60
127
+
```
128
+
25
129
2.[Restart the Agent][5].
26
130
27
131
### Validation
28
132
29
133
[Run the Agent's status subcommand][6] and look for `ibm_spectrum_lsf` under the Checks section.
30
134
135
+
31
136
## Data Collected
32
137
33
138
### Metrics
@@ -42,14 +147,16 @@ The IBM Spectrum LSF integration does not include any events.
42
147
43
148
The IBM Spectrum LSF integration does not include any service checks.
44
149
45
-
See [service_checks.json][8] for a list of service checks provided by this integration.
46
-
47
150
## Troubleshooting
48
151
152
+
Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check:
0 commit comments