Disaster Recovery
3 Dimensions of Business Continuity for Data Centric Processes
Protecting data, whether in a database system such as Microsoft SQL Server, or in a file system, is becoming an ever more critical and complex endeavor. From high-profile data breaches, to the all too common experience of losing a hard drive, we are all aware that data can be suddenly and catastrophically lost. But many, even with information technology responsibilities in their jobs, lack the time, expertise, and budget to adequately protect their most critical data.
That may sound overstated, but consider the following small list of questions from a much larger potential list:
- Do you have a firewall between your data and the internet?
- Do you take regular backups of your drives and/or databases?
- Do you have a written and rehearsed plan to restore continuity should disaster strike?
Great start, now…
- Is your firewall regularly patched to fix constantly discovered vulnerabilities?
- Do you or an outside service perform regular penetration testing to ensure your firewall is secure?
- Do any of your employees have access to take, modify, or delete critical data?
- Would you know if they did?
- If you lose data this afternoon, can you restore it to nearly the same state it was in, or will you lose the day’s changes?
- In case of a flood or fire, how many days or weeks of changes would be lost if you had to use offsite backups to recover?
- How long would it take to get your business running if you lose a critical server, database, site, or internet access?
- Is your offsite contingency process automated, or does someone have to get in a car at 3AM?
And this is just a start. It’s a lot to think about. It’s stressful. It takes time and budget. In truth, you can never be 100% protected. The best you can do is allocate what resources you have to reduce the risk as much as you can. A helpful way to organize your priorities and allocate resource is to assess your exposure along the following 3 dimensions.
Risk Avoidance
Ideally, disaster won’t strike. Systems never go down, files never become corrupt, breaches never succeed, and humans never err. This ideal cannot be fully attained, of course, which is why there are 2 other dimensions, but it is the first line of defense to strive for within the constraints of resources. Basic security measures come to mind first here. A quality firewall is critical, and it must be monitored and maintained to ensure that it is functioning as well as it can. Right along with that, a clear, sound, and consistently applied policy regarding external access must be in place and regularly reviewed and audited. In addition, servers and software should at a minimum be at a vendor-supported version and patched with at least the latest security and stability updates. Finally, and perhaps most challenging to solve, there is human error. Policies, processes, training, and quality assurance practices are tools to help avoid all too common human fallibility.
High Availability (HA)
The central question in this category is “how much downtime can your business/process tolerate?” Zero would be great, but, yet again, 100% is not realistic. A good way to think about budget for this is to find the cost of being down at the most critical part of the day, and use that as a guide—factoring for probability and risk appetite—for how much you choose to allocate to ensuring that systems are back online as quickly as possible. If being down for more than a day will cost only embarrassment and a few thousand dollars, then skip on to DR measures without much concern for HA. If it will cost $100k or more, or a substantial portion of your customer relationships, this dimension of business continuity is essential. Addressing HA spans from redundant power, servers, network components, internet providers, etc. to continuous offsite replication of data and maintenance of software in the cloud or other geographically isolated location.
Disaster Recovery (DR)
And finally, when disaster strikes, how much work and data will be lost? Again, zero sounds nice, but is unrealistic. The good news here is that some level of DR is achievable for less cost and complexity than in implementing comprehensive risk avoidance and HA. Most businesses these days are taking backups of their data at the drive and/or database levels. And most are storing the backups offsite or in the cloud. If you’re not sure this is true for you, I strongly recommend implementing this basic level of recoverability as soon as possible. This is the last and most essential line of defense in your ability to resume your business’ data centric processes should data loss occur. Beyond the basics, consider modeling out what would happen if a drive crashed during peak business hours. Do you have backups leading right up to the event or close enough? If you have near-real-time offsite replication as part of your HA capability, your data may survive a crashed drive, server, or even data center disaster. However, what about a nefarious or accidental deletion, change, or replicable corruption event that, in near-real-time, replicates to your alternate site? After you’ve modeled and implemented a DR plan, make sure to test it. Plan outages or failovers and try unplugging devices to make sure your procedures and people are ready for the real thing.