1 |
# hobbit-clients.cfg - configuration file for clients reporting to Xymon |
2 |
# |
3 |
# This file is used by the hobbitd_client module, when it builds the |
4 |
# cpu, disk, files, memory, msgs and procs status messages from the |
5 |
# information reported by clients running on the monitored systems. |
6 |
# |
7 |
# This file must be installed on the Xymon server - client installations |
8 |
# do not need this file. |
9 |
# |
10 |
# The file defines a series of rules: |
11 |
# UP : Changes the "cpu" status when the system has rebooted recently, |
12 |
# or when it has been running for too long. |
13 |
# LOAD : Changes the "cpu" status according to the system load. |
14 |
# CLOCK : Changes the "cpu" status if the client system clock is |
15 |
# not synchronized with the clock of the Xymon server. |
16 |
# DISK : Changes the "disk" status, depending on the amount of space |
17 |
# used of filesystems. |
18 |
# MEMPHYS: Changes the "memory" status, based on the percentage of real |
19 |
# memory used. |
20 |
# MEMACT : Changes the "memory" status, based on the percentage of "actual" |
21 |
# memory used. Note: Not all systems report an "actual" value. |
22 |
# MEMSWAP: Changes the "memory" status, based on the percentage of swap |
23 |
# space used. |
24 |
# PROC : Changes the "procs" status according to which processes were found |
25 |
# in the "ps" listing from the client. |
26 |
# LOG : Changes the "msgs" status according to entries in text-based logfiles. |
27 |
# Note: The "client-local.cfg" file controls which logfiles the client will report. |
28 |
# FILE : Changes the "files" status according to meta-data for files. |
29 |
# Note: The "client-local.cfg" file controls which files the client will report. |
30 |
# DIR : Changes the "files" status according to the size of a directory. |
31 |
# Note: The "client-local.cfg" file controls which directories the client will report. |
32 |
# PORT : Changes the "ports" status according to which tcp ports were found |
33 |
# in the "netstat" listing from the client. |
34 |
# DEFAULT: Set the default values that apply if no other rules match. |
35 |
# |
36 |
# All rules can be qualified so they apply only to certain hosts, or on certain |
37 |
# times of the day (see below). |
38 |
# |
39 |
# Each type of rule takes a number of parameters: |
40 |
# UP bootlimit toolonglimit |
41 |
# The cpu status goes yellow if the system has been up for less than |
42 |
# "bootlimit" time, or longer than "toolonglimit". The time is in |
43 |
# minutes, or you can add h/d/w for hours/days/weeks - eg. "2h" for |
44 |
# two hours, or "4w" for 4 weeks. |
45 |
# Defaults: bootlimit=1h, toolonglimit=-1 (infinite). |
46 |
# |
47 |
# LOAD warnlevel paniclevel |
48 |
# If the system load exceeds "warnlevel" or "paniclevel", the "cpu" |
49 |
# status will go yellow or red, respectively. These are decimal |
50 |
# numbers. |
51 |
# Defaults: warnlevel=5.0, paniclevel=10.0 |
52 |
# |
53 |
# CLOCK maximum-offset |
54 |
# If the system clock of the client differs from that of the Xymon |
55 |
# server by more than "maximum-offset" seconds, then the CPU status |
56 |
# column will go yellow. Note that the accuracy of this test is limited, |
57 |
# since it is affected by the time it takes a client status report to |
58 |
# go from the client to the Xymon server and be processed. You should |
59 |
# therefore allow for a few seconds (5-10) of slack when you define |
60 |
# your max. offset. |
61 |
# It is not wise to use this test, unless your servers are synchronized |
62 |
# to a common clock, e.g. through NTP. |
63 |
# |
64 |
# DISK filesystem warnlevel paniclevel |
65 |
# DISK filesystem IGNORE |
66 |
# If the utilization of "filesystem" is reported to exceed "warnlevel" |
67 |
# or "paniclevel", the "disk" status will go yellow or red, respectively. |
68 |
# "warnlevel" and "paniclevel" are either the percentage used, or the |
69 |
# space available as reported by the local "df" command on the host. |
70 |
# For the latter type of check, the "warnlevel" must be followed by the |
71 |
# letter "U", e.g. "1024U". |
72 |
# The special keyword "IGNORE" causes this filesystem to be ignored |
73 |
# completely, i.e. it will not appear in the "disk" status column and |
74 |
# it will not be tracked in a graph. This is useful for e.g. removable |
75 |
# devices, backup-disks and similar hardware. |
76 |
# "filesystem" is the mount-point where the filesystem is mounted, e.g. |
77 |
# "/usr" or "/home". A filesystem-name that begins with "%" is interpreted |
78 |
# as a Perl-compatible regular expression; e.g. "%^/oracle.*/" will match |
79 |
# any filesystem whose mountpoint begins with "/oracle". |
80 |
# Defaults: warnlevel=90%, paniclevel=95% |
81 |
# |
82 |
# MEMPHYS warnlevel paniclevel |
83 |
# MEMACT warnlevel paniclevel |
84 |
# MEMSWAP warnlevel paniclevel |
85 |
# If the memory utilization exceeds the "warnlevel" or "paniclevel", the |
86 |
# "memory" status will change to yellow or red, respectively. |
87 |
# Note: The words "PHYS", "ACT" and "SWAP" are also recognized. |
88 |
# Defaults: MEMPHYS warnlevel=100 paniclevel=101 (i.e. it will never go red) |
89 |
# MEMSWAP warnlevel=50 paniclevel=80 |
90 |
# MEMACT warnlevel=90 paniclevel=97 |
91 |
# |
92 |
# PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=displaytext] |
93 |
# The "ps" listing sent by the client will be scanned for how many |
94 |
# processes containing "processname" are running, and this is then |
95 |
# matched against the min/max settings defined here. If the running |
96 |
# count is outside the thresholds, the color of the "procs" status |
97 |
# changes to "color". |
98 |
# To check for a process that must NOT be running: Set minimum and |
99 |
# maximum to 0. |
100 |
# |
101 |
# "processname" can be a simple string, in which case this string must |
102 |
# show up in the "ps" listing as a command. The scanner will find |
103 |
# a ps-listing of e.g. "/usr/sbin/cron" if you only specify "processname" |
104 |
# as "cron". |
105 |
# "processname" can also be a Perl-compatiable regular expression, e.g. |
106 |
# "%java.*inst[0123]" can be used to find entries in the ps-listing for |
107 |
# "java -Xmx512m inst2" and "java -Xmx256 inst3". In that case, |
108 |
# "processname" must begin with "%" followed by the reg.expression. |
109 |
# If "processname" contains whitespace (blanks or TAB), you must enclose |
110 |
# the full string in double quotes - including the "%" if you use regular |
111 |
# expression matching. E.g. |
112 |
# PROC "%hobbitd_channel --channel=data.*hobbitd_rrd" 1 1 yellow |
113 |
# or |
114 |
# PROC "java -DCLASSPATH=/opt/java/lib" 2 5 |
115 |
# |
116 |
# You can have multiple "PROC" entries for the same host, all of the |
117 |
# checks are merged into the "procs" status and the most severe |
118 |
# check defines the color of the status. |
119 |
# |
120 |
# The TRACK=id option causes the number of processes found to be recorded |
121 |
# in an RRD file, with "id" as part of the filename. This graph will then |
122 |
# appear on the "procs" page as well as on the "trends" page. Note that |
123 |
# "id" must be unique among the processes tracked for each host. |
124 |
# |
125 |
# The TEXT=displaytext option affects how the process appears on the |
126 |
# "procs" status page. By default, the process is listed with the |
127 |
# "processname" as identification, but if this is a regular expression |
128 |
# it may be a bit difficult to understand. You can then use e.g. |
129 |
# "TEXT=Apache" to make these processes appear with the name "Apache" |
130 |
# instead. |
131 |
# |
132 |
# Defaults: mincount=1, maxcount=-1 (unlimited), color="red". |
133 |
# Note: No processes are checked by default. |
134 |
# |
135 |
# Example: Check that "cron" is running: |
136 |
# PROC cron |
137 |
# Example: Check that at least 5 "httpd" processes are running, but |
138 |
# not more than 20: |
139 |
# PROC httpd 5 20 |
140 |
# |
141 |
# LOG filename match-pattern [COLOR=color] [IGNORE=ignore-pattern] [TEXT=displaytext] |
142 |
# In the "client-local.cfg" file, you can list any number of files |
143 |
# that the client will collect log data from. These are sent to the |
144 |
# Xymon server together with the other client data, and you can then |
145 |
# choose how to analyze the log data with LOG entries. |
146 |
# |
147 |
# ************ IMPORTANT *************** |
148 |
# To monitor a logfile, you *MUST* configure both client-local.cfg |
149 |
# and hobbit-clients.cfg. If you configure only the client-local.cfg |
150 |
# file, the client will collect the log data and you can view it in |
151 |
# the "client data" display, but it will not affect the color of the |
152 |
# "msgs" status. On the other hand, if you configure only the |
153 |
# hobbit-clients.cfg file, then there will be no log data to inspect, |
154 |
# and you will not see any updates of the "msgs" status either. |
155 |
# |
156 |
# "filename" is a filename or pattern. The set of files reported by |
157 |
# the client is matched against "filename", and if they match then |
158 |
# this LOG entry is processed against the data from a file. |
159 |
# |
160 |
# "match-pattern": The log data is matched against this pattern. If |
161 |
# there is a match, this log file causes a status change to "color". |
162 |
# |
163 |
# "ignore-pattern": The log data that matched "match-pattern" is also |
164 |
# matched against "ignore-pattern". If the data matches the "ignore-pattern", |
165 |
# this line of data does not affect the status color. In other words, |
166 |
# the "ignore-pattern" can be used to refine the strings which cause |
167 |
# a match. |
168 |
# Note: The "ignore-pattern" is optional. |
169 |
# |
170 |
# "color": The color which this match will trigger. |
171 |
# Note: "color" is optional, if omitted then "red" will be used. |
172 |
# |
173 |
# Example: Go yellow if the text "WARNING" shows up in any logfile. |
174 |
# LOG %.* WARNING COLOR=yellow |
175 |
# |
176 |
# Example: Go red if the text "I/O error" or "read error" appears. |
177 |
# LOG %/var/(adm|log)/messages %(I/O|read).error COLOR=red |
178 |
# |
179 |
# FILE filename [color] [things to check] [TRACK] |
180 |
# NB: The files you wish to monitor must be listed in a "file:..." |
181 |
# entry in the client-local.cfg file, in order for the client to |
182 |
# report any data about them. |
183 |
# |
184 |
# "filename" is a filename or pattern. The set of files reported by |
185 |
# the client is matched against "filename", and if they match then |
186 |
# this FILE entry is processed against the data from that file. |
187 |
# |
188 |
# [things to check] can be one or more of the following: |
189 |
# - "NOEXIST" triggers a warning if the file exists. By default, |
190 |
# a warning is triggered for files that have a FILE entry, but |
191 |
# which do not exist. |
192 |
# - "TYPE=type" where "type" is one of "file", "dir", "char", "block", |
193 |
# "fifo", or "socket". Triggers warning if the file is not of the |
194 |
# specified type. |
195 |
# - "OWNERID=owner" and "GROUPID=group" triggers a warning if the owner |
196 |
# or group does not match what is listed here. "owner" and "group" is |
197 |
# specified either with the numeric uid/gid, or the user/group name. |
198 |
# - "MODE=mode" triggers a warning if the file permissions are not |
199 |
# as listed. "mode" is written in the standard octal notation, e.g. |
200 |
# "644" for the rw-r--r-- permissions. |
201 |
# - "SIZE<max.size" and "SIZE>min.size" triggers a warning it the file |
202 |
# size is greater than "max.size" or less than "min.size", respectively. |
203 |
# You can append "K" (KB), "M" (MB), "G" (GB) or "T" (TB) to the size. |
204 |
# If there is no such modifier, KB is assumed. |
205 |
# E.g. to warn if a file grows larger than 1MB (1024 KB): "SIZE<1M". |
206 |
# - "SIZE=size" triggers a warning it the file size is not what is listed. |
207 |
# - "MTIME>min.mtime" and "MTIME<max.mtime" checks how long ago the file |
208 |
# was last modified (in seconds). E.g. to check if a file was updated |
209 |
# within the past 10 minutes (600 seconds): "MTIME<600". Or to check |
210 |
# that a file has NOT been updated in the past 24 hours: "MTIME>86400". |
211 |
# - "MTIME=timestamp" checks if a file was last modified at "timestamp". |
212 |
# "timestamp" is a unix epoch time (seconds since midnight Jan 1 1970 UTC). |
213 |
# - "CTIME>min.ctime", "CTIME<max.ctime", "CTIME=timestamp" acts as the |
214 |
# mtime checks, but for the ctime timestamp (when the files' directory |
215 |
# entry was last changed, eg. by chown, chgrp or chmod). |
216 |
# - "MD5=md5sum", "SHA1=sha1sum", "RMD160=rmd160sum" trigger a warning |
217 |
# if the file checksum using the MD5, SHA1 or RMD160 message digest |
218 |
# algorithms do not match the one configured here. Note: The "file" |
219 |
# entry in the client-local.cfg file must specify which algorithm to use. |
220 |
# |
221 |
# "TRACK" causes the size of this file to be tracked in an RRD file, and |
222 |
# shown on the graph on the "files" display. |
223 |
# |
224 |
# Example: Check that the /var/log/messages file is not empty and was updated |
225 |
# within the past 10 minutes, and go yellow if either fails: |
226 |
# FILE /var/log/messages SIZE>0 MTIME<600 yellow |
227 |
# |
228 |
# Example: Check the timestamp, size and SHA-1 hash of the /bin/sh program: |
229 |
# FILE /bin/sh MTIME=1128514608 SIZE=645140 SHA1=5bd81afecf0eb93849a2fd9df54e8bcbe3fefd72 |
230 |
# |
231 |
# DIR directory [color] [SIZE<maxsize] [SIZE>minsize] [TRACK] |
232 |
# NB: The directories you wish to monitor must be listed in a "dir:..." |
233 |
# entry in the client-local.cfg file, in order for the client to |
234 |
# report any data about them. |
235 |
# |
236 |
# "directory" is a filename or pattern. The set of directories reported by |
237 |
# the client is matched against "directory", and if they match then |
238 |
# this DIR entry is processed against the data for that directory. |
239 |
# |
240 |
# "SIZE<maxsize" and "SIZE>minsize" defines the size limits that the |
241 |
# directory must stay within. If it goes outside these limits, a warning |
242 |
# will trigger. Note the Xymon uses the raw number reported by the |
243 |
# local "du" command on the client. This is commonly KB, but it may be |
244 |
# disk blocks which are often 512 bytes. |
245 |
# |
246 |
# "TRACK" causes the size of this directory to be tracked in an RRD file, |
247 |
# and shown on the graph on the "files" display. |
248 |
# |
249 |
# PORT [LOCAL=addr] [EXLOCAL=addr] [REMOTE=addr] [EXREMOTE=addr] [STATE=state] [EXSTATE=state] [MIN=mincount] [MAX=maxcount] [COLOR=color] [TRACK=id] [TEXT=displaytext] |
250 |
# The "netstat" listing sent by the client will be scanned for how many |
251 |
# sockets match the criteria listed. |
252 |
# "addr" is a (partial) address specification in the format used on |
253 |
# the output from netstat. This is typically "10.0.0.1:80" for the IP |
254 |
# 10.0.0.1, port 80. Or "*:80" for any local address, port 80. |
255 |
# NB: The Xymon clients normally report only the numeric data for |
256 |
# IP-adresses and port-numbers, so you must specify the port |
257 |
# number (e.g. "80") instead of the service name ("www"). |
258 |
# "state" causes only the sockets in the specified state to be included; |
259 |
# it is usually LISTEN or ESTABLISHED. |
260 |
# The socket count is then matched against the min/max settings defined |
261 |
# here. If the count is outside the thresholds, the color of the "ports" |
262 |
# status changes to "color". |
263 |
# To check for a socket that must NOT exist: Set minimum and |
264 |
# maximum to 0. |
265 |
# |
266 |
# "addr" and "state" can be a simple strings, in which case these string must |
267 |
# show up in the "netstat" at the appropriate column. |
268 |
# "addr" and "state" can also be a Perl-compatiable regular expression, e.g. |
269 |
# "LOCAL=%(:80|:443)" can be used to find entries in the netstat local port for |
270 |
# both http (port 80) and https (port 443). In that case, portname or state must |
271 |
# begin with "%" followed by the reg.expression. |
272 |
# |
273 |
# The TRACK=id option causes the number of sockets found to be recorded |
274 |
# in an RRD file, with "id" as part of the filename. This graph will then |
275 |
# appear on the "ports" page as well as on the "trends" page. Note that |
276 |
# "id" must be unique among the ports tracked for each host. |
277 |
# |
278 |
# The TEXT=displaytext option affects how the port appears on the |
279 |
# "ports" status page. By default, the port is listed with the |
280 |
# local/remote/state rules as identification, but this may be somewhat |
281 |
# difficult to understand. You can then use e.g. "TEXT=Secure Shell" to make |
282 |
# these ports appear with the name "Secure Shell" instead. |
283 |
# |
284 |
# Defaults: state="LISTEN", mincount=1, maxcount=-1 (unlimited), color="red". |
285 |
# Note: No ports are checked by default. |
286 |
# |
287 |
# Example: Check that there is someone listening on the https port: |
288 |
# PORT "LOCAL=%([.:]443)$" state=LISTEN TEXT=https |
289 |
# |
290 |
# Example: Check that at least 5 "ssh" connections are established, but |
291 |
# not more than 10; warn but do not error; graph the connection count: |
292 |
# PORT "LOCAL=%([.:]22)$" state=ESTABLISHED min=5 max=20 color=yellow TRACK=ssh "TEXT=SSH logins" |
293 |
# |
294 |
# Example: Check that ONLY ports 22, 80 and 443 are open for incoming connections: |
295 |
# PORT STATE=LISTEN LOCAL=%0.0.0.0[.:].* EXLOCAL=%[.:](22|80|443)$ MAX=0 "TEXT=Bad listeners" |
296 |
# |
297 |
# |
298 |
# To apply rules to specific hosts, you can use the "HOST=", "EXHOST=", "PAGE=" |
299 |
# "EXPAGE=", "CLASS=" or "EXCLASS=" qualifiers. (These act just as in the |
300 |
# hobbit-alerts.cfg file). |
301 |
# |
302 |
# Hostnames are either a comma-separated list of hostnames (from the bb-hosts file), |
303 |
# "*" to indicate "all hosts", or a Perl-compatible regular expression. |
304 |
# E.g. "HOST=dns.foo.com,www.foo.com" identifies two specific hosts; |
305 |
# "HOST=%www.*.foo.com EXHOST=www-test.foo.com" matches all hosts with a name |
306 |
# beginning with "www", except the "www-test" host. |
307 |
# "PAGE" and "EXPAGE" match the hostnames against the page on where they are |
308 |
# located in the bb-hosts file, via the bb-hosts' page/subpage/subparent |
309 |
# directives. This can be convenient to pick out all hosts on a specific page. |
310 |
# |
311 |
# Rules can be dependant on time-of-day, using the standard Xymon syntax |
312 |
# (the bb-hosts(5) about the NKTIME parameter). E.g. "TIME=W:0800:2200" |
313 |
# applied to a rule will make this rule active only on week-days between |
314 |
# 8AM and 10PM. |
315 |
# |
316 |
# You can also associate a GROUP id with a rule. The group-id is passed to |
317 |
# the alert module, which can then use it to control who gets an alert when |
318 |
# a failure occurs. E.g. the following associates the "httpd" process check |
319 |
# with the "web" group, and the "sshd" check with the "admins" group: |
320 |
# PROC httpd 5 GROUP=web |
321 |
# PROC sshd 1 GROUP=admins |
322 |
# In the hobbit-alerts.cfg file, you could then have rules like |
323 |
# GROUP=web |
324 |
# MAIL webmaster@foo.com |
325 |
# GROUP=admins |
326 |
# MAIL root@foo.com |
327 |
# |
328 |
# Qualifiers must be placed after each rule, e.g. |
329 |
# LOAD 8.0 12.0 HOST=db.foo.com TIME=*:0800:1600 |
330 |
# |
331 |
# If you have multiple rules that you want to apply the same qualifiers to, |
332 |
# you can write the qualifiers *only* on one line, followed by the rules. E.g. |
333 |
# HOST=%db.*.foo.com TIME=W:0800:1600 |
334 |
# LOAD 8.0 12.0 |
335 |
# DISK /db 98 100 |
336 |
# PROC mysqld 1 |
337 |
# will apply the three rules to all of the "db" hosts on week-days between 8AM |
338 |
# and 4PM. This can be combined with per-rule qualifiers, in which case the |
339 |
# per-rule qualifier overrides the general qualifier; e.g. |
340 |
# HOST=%.*.foo.com |
341 |
# LOAD 7.0 12.0 HOST=bax.foo.com |
342 |
# LOAD 3.0 8.0 |
343 |
# will result in the load-limits being 7.0/12.0 for the "bax.foo.com" host, |
344 |
# and 3.0/8.0 for all other foo.com hosts. |
345 |
# |
346 |
# The special DEFAULT section can modify the built-in defaults - this must |
347 |
# be placed at the end of the file. |
348 |
|
349 |
HOST=rabbit.<%= domain %> |
350 |
DISK %.*stage2$ IGNORE |
351 |
|
352 |
# jonund has 24 cores and we try and utilise it as much as possible |
353 |
# la of up to 1.5*cores is probably not problematic |
354 |
HOST=jonund.<%= domain %> |
355 |
LOAD 36.0 48.0 |
356 |
|
357 |
# ecosse has 24 cores, is a builder, and we try to use them all |
358 |
HOST=ecosse.<%= domain %> |
359 |
LOAD 36.0 48.0 |
360 |
|
361 |
# rabbit has 12 cores and mksquashfs uses all of them |
362 |
HOST=rabbit.<%= domain %> |
363 |
LOAD 18.0 24.0 |
364 |
|
365 |
# sucuk has 12 cores and autobuilder uses all of them |
366 |
HOST=sucuk.<%= domain %> |
367 |
LOAD 18.0 24.0 |
368 |
|
369 |
DEFAULT |
370 |
# These are the built-in defaults. |
371 |
UP 1h |
372 |
LOAD 5.0 10.0 |
373 |
DISK %^/mnt/cdrom 101 101 |
374 |
DISK * 90 95 |
375 |
MEMPHYS 100 101 |
376 |
MEMSWAP 50 80 |
377 |
MEMACT 90 97 |
378 |
CLOCK 60 |
379 |
FILE /var/lib/puppet/state/state.yaml yellow mtime<5400 |
380 |
PORT state=LISTEN "LOCAL=%([.:]22)$" MIN=1 TEXT=ssh |
381 |
PROC puppetd 0 3 red |
382 |
# 10 , just in case something goes wrong |
383 |
PROC crond 1 10 red |