Managed IT
So, You Think You Want SQL AlwaysOn Availability Groups?
When it comes to SQL Server high availability (HA) there are myriad options to choose from. Some have been around since the turn of the century, but there is a newer option, introduced in SQL 2012, that has been so hyped as to eclipse all the others: AlwaysOn Availability Groups (AGs). I recall the launch event and subsequent keynotes and sessions that touted it as even better than the greatest thing since sliced bread. But, as much as it does solve some erstwhile unsolved use case requirements, following the bread analogy, you will only get the sliced and toasty goodness if you have a master baker, Ginsu-certified slicers, and at least one beefeater on hand to test the bread at regular intervals. This may be absurd hyperbole, and is certainly an absurd analogy, but it’s worth considering the difficulty and expense of configuring SQL AGs.
Before I continue, here’s a very high-level overview of HA options and what AGs bring to the table (no further bread analogy nor pun intended):
Replication
Synchronizes (asynchronously and as fast as the network and I/O can keep up) specific tables and code objects (views, stored procedures, etc.) to one or more subscribing SQL servers. Replicas are readable and can be writable (and replicated back to the source). Indexes, permissions, etc. can be different at replica. Flexible, but not ideal for HA specifically because there can be differences between the copies.
Log Shipping
Synchronizes entire database at scheduled intervals to one or more subscribing SQL servers. Replicas are readable, but no writes nor schema changes are possible and readers will be interrupted as logs of changes are applied.
Mirroring
Deprecated (will be removed in the future, so don’t count on it for long term). Synchronizes entire database either asynchronously or synchronously (the latter guaranteeing identical data up to the last transaction at a cost of potentially much slower transactions at the source). Replicas are not readable (!) and therefore serve only as warm standby for HA failover.
Failover Clustering (now called AlwaysOn Failover Clustering because… marketing)
Requires (expensive) shared storage such as a SAN and Windows Server Failover Clustering (WSFC). Failover is at the SQL instance level (which contains one or more databases). WSFC makes it possible for failover without additional configuration of SQL clients. In other words, where all the prior HA approaches require clients or some middle tier to perform the switch from the primary to a secondary instance, WSFC presents a single portable endpoint that fails (moves) from one server to the other. However, since there is only one copy of the data, there are no offloading capabilities such as readable replicas.
Enter AlwaysOn Availability Groups. Like failover clustering, these depend on WSFC, and so have the portable endpoint advantage described above. However, they do not require shared storage. Also, though they do not operate at the SQL instance level, an AG can operate on a group of databases such that if one fails, they all fail over to a replica. In addition, though the databases are essentially mirrored either asynchronously or synchronously, just like with Mirroring, replicas can be readable! As you can see, AGs offer some great advantages over the other options. Imagine, for instance, that now you can have really fast (and cheaper than SAN) local SSD storage using RAID 10, and know that it is safely being replicated in near-real time to a warm standby in the next rack, and another in New Zealand. And further, that those warm standbys are available for read operations, allowing you to offload and geo-distribute reporting and backups. Cool, right? So, let’s do it!
Not so fast. It turns out that setting up and maintaining AGs is not only complicated, but can and has resulted in significant downtime in the real world. It seems that for all its goodness, geo-distributed WSFCs and the new SQL AG plumbing is proving to be quite fragile. For instance, if the primary server even momentarily cannot communicate with a secondary, WSFC takes the primary databases offline by design. So, your network had better be rock solid and have redundancies. Take that problem to the Internet for a WAN scenario, and you can imagine multiple unplanned outages. There are workarounds, but they get complicated and require more servers and licensing. Worse still, patches have proven to be unstable and unreliable so far (Microsoft acknowledges just some of the issues). At this point, the gurus recommend odious testing and staging of patches prior to application to critical environments (and why use AGs if not for a critical application). And a final kicker: unlike all the other options, most of the AG goodness (except basic rebranded mirroring) requires SQL Enterprise licensing.
Unless the specific use cases AGs provide are critical and the cost to setup and maintain AGs is affordable for the business, the price and risk is just not worth it. Without adequate resources on hand, the very HA we’re trying to achieve is undermined by the potential for unplanned downtime inherent in the implementation. While I would especially love to have readable mirrors in many business scenarios, we’ll have to make do with replication and other HA options until AGs become more reliable at less of a cost.