Designing Infrastructure High Availability

IT people, for some reason, seem to have an affinity towards designing solutions that use “cool” features, even when those features aren’t really necessary. This tendency sometimes leads to good solutions, but a lot of times it ends up creating solutions that fall short of requirements or leave IT infrastructure with significant short-comings in any number of areas. Other times, “cool” features result in over-designed, unnecessarily expensive infrastructure designs.

The “cool” factor is probably most obvious in the realm of High Availability design. And yes, I do realize that with the cloud becoming more common and prevalent in IT there is less need to understand the key architectural decisions needed when designing HA, but there are still plenty of companies that refuse to use the cloud, and for good reason. Cloud solutions are not meant to be one size fits all solutions. They are one size fits most solutions.

High Availability (Also called “HA”) is a complex subject with a lot of variables involved. The complexity is due to the fact that there are multiple levels of HA that can be implemented, from light touch failover to globally replicated, multi-redundant, always on solutions.

High Availability Defined

HA is, put simply, any solution that allows an IT resource (Files, applications, etc) to be accessible at all times, regardless of hardware failure. In an HA designed infrastructure, your files are always available even if the server that normally stores those files breaks for any reason.

HA has also become much more common and inexpensive in recent years, so more people are demanding it. A decade ago, any level of HA involved costs that exponentially exceeded a normal, single server solution. Today, HA is possible for as little as half the cost of a single server (Though, more often, the cost is essentially double the single server cost).

Because of the cost reduction, many companies have started demanding it, and because of the cool factor, a lot of those companies have been spending way too much. Part of why this happens is due to the history of HA in IT.

HA History Lesson

Prior to the development of Virtualization (the technology that allows multiple “Virtual” servers to run on a single physical server), HA was prohibitively expensive and required massive storage arrays, large numbers of servers, and a whole lot of configuration. Then, VMWare implemented a solution called “VMotion” that allowed a Virtual Server to be moved between server hardware immediately at the touch of a button (Called VM High Availability). This signaled a kind of renaissance in High Availability because it allowed servers to survive a hardware failure for a fraction of the cost normally associated with HA. There is a lot more involved in this shift that just VMotion (SANs, cheaper high-speed internet, and similar advancements played a big part), but the shift began about the time VMotion was introduced.

Once companies started realizing they could have servers that were always running, regardless of hardware failures, an unexpected market for high-availability solutions popped up, and software developers started developing better techniques for HA in their products. Why would they care? Because there are a lot of situations where a server solution can stop working properly that aren’t related to hardware failures, and VMotion was only capable of handling HA in the event of hardware failures.

VM HA vs Software HA

The most common mistake I see people making in their HA designs is accepting the assumption that VM-level High Availability is enough. It is most definitely not. Take Exchange server as an example. There are a number of problems that can occur in Exchange that will prevent users from accessing their email. Log drives fill up, forcing database dismount. IIS can fail to function, preventing users from accessing their mailbox. Databases can become corrupted, resulting in a complete shutdown of Exchange until the database can be repaired or restored from backup. VM HA does nothing to help when these situations come up.

This is where the Exchange Database Availability Group (DAG) comes in to play. A DAG involves constantly replication changes to Mailbox Databases to additional Exchange servers (as many of them as you want, but 2-3 is most common). With a DAG in place, any issue that would cause a database to dismount in a single Exchange server will instead result in a Failover, where the database dismounts on one server and mounts on the other server immediately (within a few seconds or less).

The DAG solution alone, however, doesn’t provide full HA for Exchange, because IIS failures will still cause problems, and if there is a hardware failure, you have to change DNS records to point them to the correct server. This is why a Load Balancer is a necessary part of true HA solutions.

Load Balancing

A Load Balancer is a network device that allows users to access two servers with a single IP address. Instead of having to choose which server you talk to, you just talk to the load balancer and it decides which server to direct you to automatically. The server that is chosen depends on a number of factors. Among those is, of course, how many people are already on each server, since the primary purpose of a load balancer is to balance the load between servers more or less equally.

More importantly, though, most load balancers are capable of performing health checks to make sure the servers are responding properly. If a server fails a health check for any reason (for instance, if one server’s not responding to HTTP requests), the load balancer will stop letting users talk to that server, effectively ensuring that whatever failure occurs on the first server doesn’t result in users being unable to access their data.

Costs vs. Benefits

Adding a load balancer to the mix, of course, increases the cost of a solution, but that cost is generally justified by the benefit such a solution provides. Unfortunately, many IT solutions fail to take this fact into account.

If an HA solution requires any kind of manual intervention to fix, the time required for notifying IT staff and getting the switch completed varies heavily, and can be anywhere from 5 minutes to several hours. From an availability perspective, even this small amount of time can have a huge impact, depending on how much money is assumed as “lost” because of a failure. Here comes some math (And not just the Trigonometry involved in this slight tangent).

Math!

The easiest way to determine whether a specific HA solution is worth implementing involves a few simple calculations. First, though, we have to make a couple assumptions, none of which are going to be completely accurate, but are meant to help determine whether an investment like HA is worth making (Managers and CEOs take note)

  1. A critical system that experiences downtime results in the company being completely unable to make money for the period of time that system is down.
  2. The amount of money lost during downtime is equal to whatever percentage of a year the system is down times the amount of annual revenue the organization expects to make in a year.

For instance, if a company’s revenue is $1,000,000 annually, they will make an average of $2 per minute (Rounded up from $1.90), so you can assume that 5 minutes of downtime costs that company about $10 in gross revenue. The cheapest of Load balancers cost about $2,000 and will last about 5 years, so you recoup the cost of the load balancer by saving yourself 200 minutes of downtime. That’s actually less than the amount of time most organizations spend updating a single server. With Software HA in place, updates don’t cause downtime if done properly, so the cost of a load balancer is covered in just being able to keep Exchange running during updates (This isn’t possible with just VM HA). But, of course, that doesn’t cover the cost of the second server (Exchange runs well on a low-end server, so $5000 for server and licenses is about what it would cost). Now imagine if the company makes $10,000,000 in revenue, or think about a company that has revenue of several billion dollars a year. HA becomes a necessity according to these calculations very quickly.

VM HA vs Software HA Cost/Benefit

Realistically, the cost difference between VM HA and Software HA is extremely low for most applications. Everything MS sells has HA capability baked in that can be done for very low costs, now that the Clustering features are included in Windows 2012 Standard. So the costs associated with implementing Software HA vs VM HA are almost always justifiable. Thus, VM HA is rarely the correct solution. And mixing the two is not a good idea. Why? Because it requires twice the storage and network traffic to accomplish, and provides absolutely no additional benefit, other than the fact that VM Replication is kinda cool. Software HA requires 2 copies of the Server to function, and each copy should use a separate server (Separate servers are required for VM HA as well, so only the OS licensing  is an increased cost) to protect against hardware failure of one VM host server.

Know When to Use VM HA

Please note, though, that I am not saying you should never use VM HA. I am saying you shouldn’t use VM HA if software HA is available. You just need to know when to use it and when not to. If software HA isn’t possible (There are plenty of solutions out there with no High Availability capabilities), VM HA is necessary and provides the highest level of high availability for those products. Otherwise, use the software’s own HA capabilities, and you’ll save yourself from lots of headaches.

Advertisements

Office 365, ADFS, and SQL

He’s an issue I’ve just run into that there doesn’t seem to be a good answer to on the Internet. When you are building a highly available ADFS farm to enable Single Sign On for Office 365, should you use the Windows Integrated Database (WID) that comes with Windows Server or store the ADFS Configuration on a SQL server? Put simply, if you are using ADFS *only* for Office 365, there is no need to use SQL server for storing the configuration database. Here’s why.

What Does ADFS with SQL Get You?

I spent a good day or so trying to answer this question because I came in after the initial configuration of a Hybrid Exchange migration to Office 365 that was set up using SQL to store ADFS. When we went to test failover, we ran into a *mess* of problems with this configuration. So I’m going to try to explain things in a way that makes sense, rather than just telling you what Technet says in their article on the subject.

Storing your ADFS configuration in a SQL database gives you 4 things:

  1. The ability to detect and block SAML and WS-Federation Token Replay attempts.
  2. The ability to use SAML artifact resolution
  3. The ability to use SQL fail-over to increase availability of the Configuration database
  4. Scalability!!!

Using WID with ADFS prevents these features from being available to you. Here’s what each of those lines of text mean (Why no, I’m not going to just give you that without definitions like Technet does. Can you tell I’m a little grumpy about Technet’s coverage of this subject?)

SAML and WS-Federation Token Replay Attempts

This is actually a potential security vulnerability that comes about because of how Token authentication mechanisms work. In some cases, an application that accepts a federated token identity for access may allow users to “replay” a previously issued authentication Token to bypass the authentication mechanisms used to issue that Token. You can do this by ending a session with a federated application, then returning to the application with a cached version of the interaction. The easiest way to do this is to just push the Back button in a web browser. If the federated application isn’t properly designed, it will accept the cached version of the token used in the previous session and continue operating just like someone never logged out.

It should be noted that this is a serious security issue. If you have an environment where there are systems exposed to public use by multiple users (Kiosks, for example), you will want to make sure that Token Replay attempts are dropped. But here’s the thing, Token Replay is a vulnerability that affects the application you are using federated login to access. It is not a vulnerability of ADFS itself. Basically, if the applications that you are using ADFS to federate with are designed properly, there is no need to have your ADFS environment provide this kind of protection. Office 365 was designed to prevent Token Replay attempts from succeeding. So if you are only federating with Office 365, you don’t need to have this functionality in your ADFS environment. So, no need to have a SQL based Configuration Database for this reason.

SAML Artifact Resolution

I have tried for a while to find a good definition for what SAML Artifact Resolution actually is. I’m still not 100% on it, because the technical documentation for the SAML 2.0 standard is about as informative as a block of graphite. However, it seems like this is essentially a method through which both sides of a SAML token exchange can check on and authenticate one another using a single session, rather than multiple sessions. This comes in handy for situations where load balancing is in use and the application requests verification of the token provider.

The good news is that you don’t have to understand SAML Artifact Resolution. Office 365 doesn’t use it, so you don’t have to worry about having it. No SQL required for this purpose.

SQL Failover Capabilities

At first glance, having SQL server failover capabilities for your ADFS configuration database sounds like a great idea. I mean, SQL server has great failover capabilities built in. The problem is, so does ADFS. When you use WID for your ADFS Configuration database, the first server in the Farm will store a read/write enabled copy of the configuration database. Each additional ADFS server added to the farm will pull a copy of this database and store it in a read only format. So in a WID based ADFS farm, every ADFS server in the farm has a copy of the configuration database by default. ADFS Proxy servers don’t store or need access to the configuration database, so only the backend ADFS servers will benefit at all from SQL integration.

But here’s the thing, for SQL integration to work properly in a way that allows full SQL failover capabilities, you need to have a valid SQL cluster. With SQL 2008R2 and earlier, this means that you have to have at minimum 2 SQL servers installed on a version of Windows that has Cluster Services available. Cluster services requires that you have shared storage before it will create a failover cluster for SQL versions prior to 2012. If you have 2012, you can manage much simpler failover, but if you’re limited to SQL 2008R2 and earlier, you’re going to end up with a significantly higher infrastructure requirement to get SQL failover working. It you can use SQL 2012 to store the ADFS database, there is some advantage to using SQL clustering for ADFS, but the advantage doesn’t actually justify the costs of doing so.

“What about SQL Mirroring?” You ask. Well, that’s a good idea in theory. You can have your ADFS config database mirrored on two SQL servers, but you will run into some major headaches using this technique, especially if you ever need to run a failover. First off, ADFS configuration with SQL server requires that you use the SFConfig command line tool to create the config database and add servers to the farm. This is a huge pain to do, but it’s necessary. Second, if you ever have to fail over to the other SQL server for any reason, you have to reconfigure every one of the servers in the farm. SQL mirroring doesn’t utilize a clustered VIP, so the configuration with SFConfig requires you to point the server to the active SQL server. When you activate the SQL database on the other server, things may or may not work without reconfiguration. If they do work, there’s a chance that they might *stop* working at any time in the future without notice until you run SFConfig on each member of the farm and use the now active mirror’s address in the command. Then you may have to log in to each of your Proxy servers and run the Proxy Config wizard to reinitialize the proxy trust between it and the ADFS farm. That means that if you want to fail over with SQL mirroring, you have to log in to 4 servers at minimum and reconfigure each one every time the active SQL mirror changes. This is a truly unwieldy and excessive failover process. It isn’t worth the loss of expense from not using a full SQL cluster to do this for ADFS’s configuration database.

The failover abilities of SQL allow you to have constantly available access to a writable copy of the ADFS configuration database. If you use WID and the first server in the farm goes down, you cannot make changes to your ADFS configuration. No adding new servers, no modifications to the ADFS functions, etc. The good news is that the only time you need access to a writable copy of the database is when you are adding new servers to the farm or making modifications to ADFS functions. If you’re using Office 365, you almost never have to make changes to either of these once its set up and running. ADFS will continue operating as long as even one server in the farm is accessible without a writable copy of the database. So realistically, you don’t need SQL failover capabilities for ADFS’s configuration database.

Scalability!!!

Scalability is one of my favorite buzz words. Scalability means “I can add more servers to handle this workload without any problems.” And SQL allows you to do that. However, WID integrated ADFS allows you to do so as well. The limitation, though, is that you can only have up to 5 servers in a WID based ADFS farm (NOTE: Proxy servers don’t count against this maximum number). “Only five servers!?” you may ask in heated exasperation. “Yes!” I answer. “Well, I can’t stand limitations! I will use SQL!” you respond. “You will never need more than 4 ADFS servers in ever!” I retort. ADFS is one of the most light-weight roles available for Windows server. I have deployed ADFS on numerous environments with user counts well past the 10,000 mark and the ADFS servers almost never measure a blip on the performance monitors. This is because ADFS does three things, it provides a website for users to log in to, it queries Active Directory to authenticate users, and it generates a small token that contains less than 1KB of data that it then sends to a remote service. This entire process requires about 10 seconds per user per session, and that includes about 9 seconds of someone typing in their credentials. So there isn’t really even much need to have load balancing capabilities on ADFS. One server can handle monstrous loads. If you have an environment with 100,000 users accessing numerous federated applications, you might need to add 1 or 2 extra servers to your farm to handle the load, but I doubt it. You’re better off just adding RAM and processing power to your ADFS servers to improve performance than adding extra servers to the farm.

If you want multi-site high availability, you may need 5 servers, but probably not more. Let’s say you want to have site resiliency in your ADFS configuration as well as failover capability in your primary and secondary site. This configuration requires a minimum of 4 ADFS servers. Two for the primary site, two for the failover site, and 4 proxy servers that don’t count against the maximum of 5 servers. There you have it, full site-resiliency and in-site failover capability using less than 5 servers. You can even add a third site to the farm without having any issues if you want using the remaining server. But let’s be honest here, if there is an emergency that takes down two geographically dispersed datacenters at the same time, you’re going to be looking for the nearest shotgun to fight off zombies/anarchist bandits, not checking to make sure your ADFS farm is still working, so triple hot-site resiliency for ADFS is probably more than anyone needs. So you don’t need SQL for ADFS if you want scalability. It’s scalable enough without SQL.

EXPLOSIONS! I mean Conclusions

If you are planning to build an ADFS farm for Single Sign On with Office 365, you will ask the question, “Do I need SQL?” The answer to that question, in pretty much every possible case is “Not if you’re just going to use it for Office 365.” SQL integrated ADFS configurations add some features, sure, but there is almost no situation where any of those features must exist for ADFS to work with Office 365. So don’t bother even trying it out. It adds an unnecessary level of complexity, cost, and management difficulties and give no advantages whatsoever.