Failover Server Configuration

From ViciWiki
Jump to: navigation, search

Question: How do I configure a server to be a "Failover" server with both inbound and outbound calls including my agents and all my data?

So we need to have inbound calls re-routed to the new server and allow the new server to begin making outbound calls (which we can actually do in advance) and we need to re-point all agents to the new server. Plus (of course) we'll need the configuration and database pulled over "as recent as possible".

How quickly MUST this new server go online in case of failure? And how much effort will you (or a minion) or a service provider (such as us or someone you call for the switchover) be expected to exert?

Note two basic concepts that should be obvious, but I like to spell them out at this point to be clear:

  1. FAST failover costs MORE. Slower costs less. This can range anywhere from five minutes to an hour.
  2. Fully automated costs MORE. Manual switching costs less. Service Provider switchover costs based on the charges of that provider.

Also worthy of note: Many clients request that this be a "simple" process and appear to believe that "simple" somehow translates to "not very expensive". I generally refer these to my new car which has a "simple" push-button start. You can rest assured that the push-button start was NOT inexpensive to add as a feature. Simplicity also costs money. :)

All that out of the way, here we go with some more concepts:

Physically, where are the two servers? Are they in the same building? City? Country? Room?

What Telephone provider is being used for inbound and outbound calls? And what sort of connection is it?

Outbound:

Easy! We just add this server as "able to dial" during the build and outbound is "handled".

Inbound:

  1. Registration-based calls can be "re-routed" automatically by registering from the new server (based on some form of failover detection, or just a button on a web page).
  2. NonRegistration-based calls can be set to "failover" at the carrier in some cases. This would require actively "killing" the failed server in case of partial fail (ie: if the DB Engine dies, that will not auto-route calls, as the telephone engine would still be accepting connections: We'd need to shut down the telephone engine for failover in this case). This does require configuration at the carrier and not all carriers have this functionl
  3. IP Based calls: If the carrier has no failover functionality, we have to either re-direct the inbound calls in a carrier configuration page, or by directly contacting the carrier, or by switching the Backup server's IP to the Primary server's IP so it can intercept the calls. We'll need to discuss this after looking at your carrier's options and possibilities (and your choice for speed and ease above, of course).

Agents/Users:

  1. What URL do your users access the system through? The domain name on that URL can be modified in case of a failure, but doing so generally can take anywhere from 15 minutes to an hour (depending upon your DNS settings) unless you swap the IP between the Backup and primary servers.
  2. What "server name" do your users Phone's register to? This will need to be re-routed to the new server as well.

Data Migration:

  1. How recent must the data be? (less than 5 minutes of loss? An hour? From Sunday?)
  2. How quickly must the new server be "Live" and taking calls WITH data?
    Note that it is possible to begin taking calls immediately and perform a restore of more recent data, but merging the two data sets can be an issue. Some like this to allow "we'll call you back shortly, we are experiencing technical difficulties" while performing the restore.
  3. How much interaction will there be for this restore? Less interaction at "fail" is more expensive "now".

There is have the option of "just give me something that will save my @ss in case of a failure, even if we're down for an hour during transition ... that beats DEAD server with no prospects of bringing it back up any time soon". That version will still require up to an hour of work on our part to bring the new server "up to date and/or online" at the moment of failure, of course.