Save to My DOJO
Table of contents
Perhaps the only thing worse than having a disaster strike your datacenter is the stress of recovering your data and services as quickly as possible. Most businesses need to operate 24 hours a day and any service outage will upset customers and your business will lose money. According to a 2016 study by the Ponemon Institute, the average datacenter outage costs enterprises over $750,000 and lasts about 85 minutes, losing the businesses roughly $9,000 per minute. While your organization may be operating at a smaller scale, any service downtime or data loss is going to hurt your reputation and may even jeopardize your career. This blog is going to give you the best practices for how to recover your data from a backup and bring your services online as fast as possible.
Automation to Decrease your Recovery Time Objective (RTO)
Automation is key when it comes to decreasing your Recovery Time Objective (RTO) and minimizing your downtime. Any time you have a manual step in the process, it is going to create a bottleneck. If the outage is caused by a natural disaster, relying on human intervention is particularly risky as the datacenter may be inaccessible or remote connections may not be available. As you learn about the best practice of detection, alerting, recovery, startup, and verification, consider how you could implement each of these steps in a fully-automated fashion.
Detect Outages Faster
The first way to optimize your recovery speed is to detect the outage as quickly as possible. If you have an enterprise monitoring solution like System Center Operations Manager (SCOM), it will continually check the health of your application and its infrastructure, looking for errors or other problems. Even if you have developed an in-house application and do not have access to enterprise tools, you can use Windows Task Manager to set up tasks that automatically check for system health by scanning event logs, then trigger recovery actions. There are also many free monitoring tools such as Uptime Robot which alerts you anytime your website goes offline.
Initiate the Recovery Process
Once the administrators have been alerted, immediately begin the recovery process. Meanwhile, you should run a secondary health check on the system to make sure that you did not receive a false alert. This is a great background task to continually run during the recovery process to make sure that something like a cluster failover or transient network failure does not force your system into restarting if it is actually healthy. If the outage was indeed a false positive, then have a task prepared which will terminate the recovery process so that it does not interfere with the now-healthy system.
Select the Optimal Backup
If you restore your service and determine that there was data loss, then you will need to make a decision whether to accept that loss or if you should attempt to recover from the last good backup, which can cause further downtime during the restoration. Make sure you can automatically determine whether you need to restore a full backup, or whether a differencing backup is sufficient to give you a faster recovery time. By comparing the timestamp of the outage to the timestamp on your backup(s), you can determine which option will minimize the impact on your business. This can be done with a simple PowerShell script, but make sure that you know how to get this information from your backup provider and pass it into your script.
Prioritize Backup Network Traffic
Once you have identified the best backup, you then need to copy it to your production system as fast as possible. A lot of organizations will deprioritize their backup network since they are only used a few times a day or week. This may be acceptable during the backup process, but these networks need to be optimized during recovery. If you do need to restore a backup, consider running a script that will prioritize this traffic, such as by changing the quality of service (QoS) settings or disabling other traffic which uses that same network.
Provision Fast Disks for Recovery
Next, consider the storage media which the backup is copied before the restoration happens. Try to use your fastest SSD disks to maximize the speed in which the backup is restored. If you decided to backup your data on a tape drive, you will likely have high copy speeds during restoration. However, tape drives usually require manual intervention to find and mount that drive, which should generally be avoided if you want a fully automated process. You can learn more about the tradeoffs of using tape drives and other media here.
Restart Services and Applications
Once your backup has been restored, then you need to restart the services and applications. If you are restoring to a virtual machine (VM), then you can optimize its startup time by maximizing the memory which is allocated to it during startup and operations. You can also configure VM prioritization to ensure that this critical VM starts first in case it is competing with other VMs to launch on a host which has recently crashed. Enable QoS on your virtual network adapters to ensure that traffic flows through to the guest operating system as quickly as possible, which will speed up the time to restore a backup within the VM, and also help clients reconnect faster. Whether you are running this application within a VM or on bare metal, you can also use Task Manager to enhance the priority of the important processes.
Verify that the Recovery Worked
Now verify that your backup was restored correctly and your application is functioning as expected by running some quick test cases. If you feel confident that those tests worked, then you can allow users to reconnect. If those tests fail, then work backward through the workflow to try to determine the bottleneck, or simply roll back to the next “good” backup and try the process again.
Regularly Test Backup and Recovery
Anytime you need to restore from a backup, it will be a frustrating experience, which is why testing throughout your application development lifecycle is critical. Any single point of failure can cause your backup or recovery to fail, which is why this needs to be part of your regular business operations. Once your systems have been restored, always make sure your IT department does a thorough investigation into what caused the outage, what worked well in the recovery, and what areas could be improved. Review the time each step took to complete and ask yourself whether any of these should be optimized. It is also a good best practice to write up a formal report which can be saved and referred to in the future, even if you have moved on to a different company.
The top software backup provides like Altaro can help you throughout the process by offering backup solutions for Hyper-V, Azure, O365 and PCs with the Altaro API interface which can be used for backup automation.
No matter how well you can prepare your datacenter, disasters can happen, so make sure that you have done all you can to try to recover your data – so that you can save your company!
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!