Monitoring and Testing

From Dryad wiki
Jump to: navigation, search

The Dryad production system is monitored via many services. All test different aspects of the system.

Nagios

Nagios runs on NESCent systems to verify that major Dryad functionality is in place. When it detects an error, it sends emails and text messages to the appropriate personnel.

Nagios performs these checks:

  • the machine is running
  • the Dryad home page responds
  • searches return results
  • frequency of error messages in the Dryad log files
  • Number of processes on the host are below a user-defined threshhold

Local configuration parameters for nagios are in /etc/nagios/nrpe.cfg

All Dryad-related Nagios checks (password protected)

Nagios also performs very high-level tests of the non-production systems.

Hyperic HQ

NCSU runs Hyperic HQ on the production server to monitor its internal status (memory, cpu usage, etc.)

The Hyperic system is at https://spectre.lib.ncsu.edu/ (password protected).

DNS failover

Dryad's DNS entries are managed by MCNC. When an issue is detected (i.e., when the homepage doesn't respond), all traffic is rerouted to a secondary server. For details, see the Failover page.

DataONE

DataONE will run a process to monitor the status of the DataONE interface to Dryad.

Web monitoring