Monitoring and Testing

The Dryad production system is monitored via many services. All test different aspects of the system.

Nagios
Nagios runs on NESCent systems to verify that major Dryad functionality is in place. When it detects an error, it sends emails and text messages to the appropriate personnel.

Nagios performs these checks:
 * the machine is running
 * the Dryad home page responds
 * searches return results
 * frequency of error messages in the Dryad log files
 * Number of processes on the host are below a user-defined threshhold

Local configuration parameters for nagios are in /etc/nagios/nrpe.cfg

All Dryad-related Nagios checks (password protected)

Nagios also performs very high-level tests of the non-production systems.

Hyperic HQ
NCSU runs Hyperic HQ on the production server to monitor its internal status (memory, cpu usage, etc.)

The Hyperic system is at https://spectre.lib.ncsu.edu/ (password protected).

DNS failover
Dryad's DNS entries are managed by MCNC. When an issue is detected (i.e., when the homepage doesn't respond), all traffic is rerouted to a secondary server. For details, see the Failover page.

DataONE
DataONE will run a process to monitor the status of the DataONE interface to Dryad.

Web monitoring

 * Mod_qos -- information about apache usage.
 * Java Melody -- information about Tomcat's internal state
 * Bandwidth usage -- constantly-updated graph of bandwidth usage.