home *** CD-ROM | disk | FTP | other *** search
- .sp 0.5i
- .ce 2
- Keeping watch over the flocks
- at night (and day)
- .sp 0.3i
- .ce 8
- Kenneth Ingham
- University of New Mexico Computing Center
- Distributed Systems Group
- 2701 Campus NE
- Albuquerque, NM 87131
- (505) 277-8044
- ingham@charon.unm.edu
- ucbvax!unmvax!charon!ingham
- .sp 0.2i
- .ce
- Topic Areas: Applications, System management, Utilities
- .sp 0.5i
- The computing facilities offered by the University of New Mexico
- Computing Center include three microvaxen, five large vaxen (780 or
- bigger), and a Sequent B8000. In addition to these Unix/VMS machines,
- the UNMCC Distributed Systems Group (DSG) monitors a number of the
- various microvaxen and sun workstations scattered across campus. This
- duty falls to the DSG Programmer designated as "DOC", or "DSG On Call",
- who receives his beeper based on a monthly rotation schedule.
- .sp
- In the past, shell scripts running every six hours reported various
- system statistics to DOC, who then scanned the output for signs of
- possible trouble. As the number of machines and the number of
- potential problems grew, the mound of output that DOC had to process,
- most of which merely indicated normal system operation, became
- overwhelming. Now, with several machines to monitor and only one
- person acting in this capacity, DOC can often waste a tremendous amount
- of time wading through system status reports, time which can be better
- spent actually fixing system problems.
- .sp
- In response to this situation, the author developed a tool which
- introduces some intelligence into the machine's self-reporting, letting
- the machine filter out messages indicating normal operation and
- forwarding to DOC only those messages which point out trouble areas.
- The result of these efforts is Watcher, a very general and extensible
- system self-monitor. Running more often than the set of
- shell scripts, Watcher keeps closer tabs on the system; since it
- delivers only a summary of potential problems, however, this extra
- monitoring produces \fIno\fR corresponding increase in the demand on
- the system manager. No problems slip by unnoticed in the more concise
- output, leading to an improvement in overall system availability as well
- as the more effective utilization of the system manager's time.
- .sp
- Watcher was designed to be almost as flexible as DOC in deciding what
- constitutes a problem with the system. Running at intervals specified
- in crontab, Watcher issues a number of
- user-specified commands (each of which
- delivers its output in a different format), parsing all or part of the
- output from either the left or the right. It compares this
- to the last such output obtained, checking for indications
- of a system abnormality. Such signs might take the form of a
- too abrupt change in a certain value (e.g. a process which suddenly
- begins gobbling vast amounts of cpu time),
- a value which exceeds the allowable maximum or minimum (such as a
- an overly-full file system),
- or an unacceptable change in a string value
- (e.g. when "up" changes to "down"). For commands such as
- "ps" whose output varies considerably with each run, specific
- parts of the output can be designated as a key; successive runs of
- Watcher will home in on these key areas for their comparisons.
- .sp
- Since the user specifies not only the commands Watcher will execute and
- the time lapse between successive runs, but also the aforementioned
- parameters which indicate system anomalies, Watcher can easily be seen
- as a very flexible, general system monitor. Its use at UNM has provided
- a marked increase in the productivity of the system manager, which has
- led in turn to the increase in the reliability and availability of the
- systems at UNMCC.
-