DNS and Failover

From Dryad wiki
Jump to: navigation, search

 Goals:  

  • Provide one way replication to a read only copy of the primary datadryad.org server
  • Make replication as close to real time as possible
  • Make failover to the secondary server and failback to the primary server automatic


Current situation:

  • Dryad production at NCSU runs rsync every minute of everything in /opt (unless the previous rsync run hasn't finished) to the failover system at Duke
  • The failover system is running Bucardo which provides asynchronous database replication of the dryad_repo database from the primary server at NCSU to the secondary server at Duke
  • Apache is configured on the secondary server at Duke to disallow logins or submission of data.  Users would never see this FQDN, but the secondary site can be reached directly at http://dryad-dev.nescent.org
  • As member universities of MCNC, Duke and NCSU have free access to MCNC's Cisco GSS systems.  These systems are redundant and very reliable.  MCNC has configured the servers for DNS based failover from the primary to the secondary datadryad.org systems.
  1. failover is based on http head requests.  If the webserver returns a 200 status, the primary site is considered up.  If not, the GSS sends new DNS requests to the secondary server until the primary server responds again.  When the primary server responds again, the DNS look ups point to it again.  http://www.cisco.com/en/US/docs/app_ntwk_services/data_center_app_services/gss4400series/v1.3/configuration/cli/gslb/guide/Intro.html#wp1119392
  2. I have verified that this works as expected using the datadryad.com domain 
  • All traffic between the two servers goes through IPSEC to encrypt all data transfer (and keep from getting blocked by the Duke Intrustion Prevention System)


Ideas for improvement:

  • If we don't want to depend on a third party like MCNC or want more extensive "health" checks, we could set up a virtual machine (or two) at a cloud host such as EC2 and use it for failover.  This would allow for more extensive testing of the primary site in order to trigger a failover.  I have used this in the past (http://cbonte.github.com/haproxy-dconv/configuration-1.4.html#4-http-check expect) and it can trigger failover based on a string in the HTTP response similar to our current Nagios heath checks.  This would also be inexpensive ($50-$100/month) as the virtual machines could be very small such as EC2 micro instances.  Large data transfers could go directly to the primary server rather than through the load balancer and thus would not count against any bandwidth quotas.
  • Rather than rsync, we could use something like glusterfs for real time file replication.  This would require extensive testing and be much more complex, but is a mature technology and widely used - http://www.gluster.org/community/documentation/index.php/Gluster_3.2:_Managing_GlusterFS_Geo-replication.  I am have been using glusterfs on 8 old DSCR nodes we used for OpenSim.
  • If we want to stick with MCNC or another failover service using HTTP status for heath checks, we could set up Nagios health checks of the production site that would shut down Apache and trigger a failover if a certain string is not on the website.
  • Make the failover site read/write.  If we control the failover process, we could make the secondary server read/write.  Before we would switch back to the primary, we could sync files and the database back from the secondary to the primary.  This would involve some down time and more complication, but it doable.


Configuration information:

  • If IP addresses used for the failover change, contact MCNC - https://www.mcnc.org/support/contact.html They are a 24x7 operation so should always be reachable.
  • To disable the failover, log into namecheap with the nescent account (password is in KeePass which Helpdesk and Sysadmin have access to) and change the DNS records for datadryad.org
  1. datadryad.org to an A record for 152.1.24.169
  2. www.datadryad.org to an A record for 152.1.24.169
  3. delete second NS record for www.datadryad.org


Misc.

These items could be synced on a weekly basis or when needed:

rsync -ahW --progress --stats --delete --exclude 'largeFilesToDeposit/' --exclude 'memcpu_dump/' /home/ dryad-dev.nescent.org:/home/
#after changes to production configuration
rsync -ahW --progress --stats --delete --exclude 'access.log*' --exclude 'tivoli/' --exclude 'log/' /opt/ dryad-dev.nescent.org:/opt/
rsync -ahW --progress --stats --delete /usr/local/apache-tomcat-6.0.33/ dryad-dev.nescent.org:/usr/local/apache-tomcat-6.0.33/
rsync -ahW --progress --stats --delete --exclude='logs/' --exclude='temp/' --exclude='newrelic/' /var/tomcat/ dryad-dev.nescent.org:/var/tomcat/
rsync -ahW --progress --stats --delete /usr/java/ dryad-dev.nescent.org:/usr/java/
rsync -ahW --progress --stats --delete /var/www/dryad/ dryad-dev.nescent.org:/var/www/dryad/
#only run during setup
#rsync -avhW --progress --stats --delete /etc/httpd/ dryad-dev.nescent.org:/etc/httpd/