Skip to content

Commit d518118

Browse files
committed
Add documentation
1 parent 8c6c226 commit d518118

File tree

2 files changed

+264
-153
lines changed

2 files changed

+264
-153
lines changed

ibm_spectrum_lsf/README.md

Lines changed: 119 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,7 @@
44

55
This check monitors [IBM Spectrum LSF][1] using the Datadog Agent.
66

7-
Include a high level overview of what this integration does:
8-
- What does your product do (in 1-2 sentences)?
9-
- What value will customers get from this integration, and why is it valuable to them?
10-
- What specific data will your integration monitor, and what's the value of that data?
7+
This integration gives an overview of the performance of your IBM Spectrum LSF environment. It also provides detailed information about running and completed jobs, slot utilization, and queues.
118

129
## Setup
1310

@@ -16,18 +13,126 @@ Follow the instructions below to install and configure this check for an Agent r
1613
### Installation
1714

1815
The IBM Spectrum LSF check is included in the [Datadog Agent][2] package.
19-
No additional installation is needed on your server.
16+
17+
Install the Datadog Agent and configure the IBM Spectrum LSF check on the management host of your cluster. This integration will monitor the entire cluster.
18+
19+
#### On Linux
20+
21+
Add the `dd-agent` user as an LSF [administrator][10].
22+
23+
The integration runs commands such as `lsid`, `bhosts`, and `lsclusters`. In order to run these commands, the Agent needs them in its `PATH`. This is typically done by running `source $LSF_HOME/conf/profile.lsf`. However, the Datadog Agent uses upstart or systemd to orchestrate the datadog-agent service. Environment variables may need to be added to the service configuration files at the default locations of:
24+
- Upstart: `/etc/init/datadog-agent.conf`
25+
- Systemd: `/lib/systemd/system/datadog-agent.service`
26+
27+
To get the enviornment variables necessary for the agent service, locate the `<LSF_TOP_DIR>/conf/profile.lsf` file and run the following command:
28+
29+
`env -i bash -c "source <LSF_TOP_DIR>/conf/profile.lsf; env"`
30+
31+
This will output a list of environment variables necessary to run the IBM Spectrum LSF commands.
32+
33+
##### For Operating Systems using systemd
34+
35+
Add these environment variables to the file `/etc/datadog-agent/environment`; here is an example:
36+
37+
```
38+
LSF_SERVERDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc
39+
LSF_ENVDIR=<LSF_TOP_DIR>/conf
40+
LSF_BINDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin
41+
LSF_LIBDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
42+
PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc:<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/bin:.
43+
LD_LIBRARY_PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
44+
```
45+
Restart the agent.
46+
47+
View more information about setting environment variables for the Datadog Agent [here][11].
48+
49+
##### For Operating Systems using upstart
50+
51+
After getting the list of necessary environment variables, add them to the `/etc/init/datadog-agent.conf` file.
52+
53+
Example of the configuration for upstart:
54+
55+
```
56+
description "Datadog Agent"
57+
58+
start on started networking
59+
stop on runlevel [!2345]
60+
61+
respawn
62+
respawn limit 10 5
63+
normal exit 0
64+
65+
console log
66+
env DD_LOG_TO_CONSOLE=false
67+
env LSF_SERVERDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc
68+
env LSF_ENVDIR=<LSF_TOP_DIR>/conf
69+
env LSF_BINDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin
70+
env LSF_LIBDIR=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
71+
env PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/etc:<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/bin:/usr/local/bin:/usr/bin:/bin:.
72+
env LD_LIBRARY_PATH=<LSF_TOP_DIR>/10.1/linux3.10-glibc2.17-x86_64/lib
73+
74+
setuid dd-agent
75+
76+
script
77+
exec /opt/datadog-agent/bin/agent/agent start -p /opt/datadog-agent/run/agent.pid
78+
end script
79+
80+
rm -f /opt/datadog-agent/run/agent.pid
81+
end script
82+
```
83+
Restart the agent.
84+
85+
Each time there is an Agent update, `/etc/init/datadog-agent.conf` is wiped and needs to be updated again.
2086

2187
### Configuration
2288

2389
1. Edit the `ibm_spectrum_lsf.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your `ibm_spectrum_lsf` performance data. See the [sample ibm_spectrum_lsf.d/conf.yaml][4] for all available configuration options.
2490

91+
The IBM Spectrum LSF integration will run a series of management commands to collect data. To control what commands are run and what metrics are emitted, use the `metric_sources` configuration option. By default, data from the following commands are collected: `lsclusters`, `lshosts`, `bhosts`, `lsload`, `bqueues`, `bslots`, `bjobs`, but you can enable more optional metrics or opt-out of collecting any set of metrics.
92+
93+
For example, if you would like to measure only GPU specific metrics, your metric sources will look like:
94+
```
95+
metric_sources:
96+
- lsload_gpu
97+
- bhosts_gpu
98+
```
99+
100+
The `badmin_perfmon` metric source collects fata from the `badmin perfmon view -json` command. This collects [overall statistics][12] about the cluster. To collect these metrics, performance collection must be enabled on your server using the `badmin perfmon start <COLLECTION_INTERVAL>` command. By default, the integration will run this command automatically (and stop collection once the agent is turned off). However, you can turn off this behavior by setting `badmin_perfmon_auto: false`.
101+
102+
Since collecting these metrics can add extra load on your server, we recommend setting a higher collection interval for these metrics, or at least 60. The exact depends on the load and size of your cluster. View IBM Spectrum LSF's [recommendations][13] for managing high query load.
103+
104+
Similarly, the `bhist` command collects information about completed jobs, which can be query intensive so we recommend monitoring this command with the `min_collection_interval` set to 60.
105+
106+
Here is a sample configuration monitoring all available metrics:
107+
108+
```
109+
instances:
110+
- cluster_name: test-cluster
111+
metric_sources:
112+
- lsclusters
113+
- lshosts
114+
- bhosts
115+
- lsload
116+
- bqueues
117+
- bslots
118+
- bjobs
119+
- lsload_gpu
120+
- bhosts_gpu
121+
- cluster_name: test-cluster
122+
badmin_perfmon_auto: false
123+
metric_sources:
124+
- badmin_perfmon
125+
- bhist
126+
min_collection_interval: 60
127+
```
128+
25129
2. [Restart the Agent][5].
26130

27131
### Validation
28132

29133
[Run the Agent's status subcommand][6] and look for `ibm_spectrum_lsf` under the Checks section.
30134

135+
31136
## Data Collected
32137

33138
### Metrics
@@ -42,14 +147,16 @@ The IBM Spectrum LSF integration does not include any events.
42147

43148
The IBM Spectrum LSF integration does not include any service checks.
44149

45-
See [service_checks.json][8] for a list of service checks provided by this integration.
46-
47150
## Troubleshooting
48151

152+
Use the `datadog-agent check` command to view the metrics the integration is collection, as well as debug logs from the check:
153+
154+
`sudo -u dd-agent bash -c "source /usr/share/lsf/conf/profile.lsf && datadog-agent check ibm_spectrum_lsf -l debug"`
155+
49156
Need help? Contact [Datadog support][9].
50157

51158

52-
[1]: **LINK_TO_INTEGRATION_SITE**
159+
[1]: https://www.ibm.com/products/hpc-workload-management
53160
[2]: https://app.datadoghq.com/account/settings/agent/latest
54161
[3]: https://docs.datadoghq.com/containers/kubernetes/integrations/
55162
[4]: https://github.com/DataDog/integrations-core/blob/master/ibm_spectrum_lsf/datadog_checks/ibm_spectrum_lsf/data/conf.yaml.example
@@ -58,3 +165,7 @@ Need help? Contact [Datadog support][9].
58165
[7]: https://github.com/DataDog/integrations-core/blob/master/ibm_spectrum_lsf/metadata.csv
59166
[8]: https://github.com/DataDog/integrations-core/blob/master/ibm_spectrum_lsf/assets/service_checks.json
60167
[9]: https://docs.datadoghq.com/help/
168+
[10]: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=cluster-adding-administrators
169+
[11]: https://docs.datadoghq.com/agent/guide/environment-variables/#using-environment-variables-in-systemd-units
170+
[12]: https://www.ibm.com/docs/ru/spectrum-lsf/10.1.0?topic=performance-monitor-metrics-in-real-time
171+
[13]: https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=tips-maintaining-cluster-performance

0 commit comments

Comments
 (0)