Save to My DOJO
As an IT professional, you probably already recognize that backing up your services, operating system and data are critical business tasks. But have you ever thought through the individual operational steps which must happen across all of your software and hardware for a backup to be successful? There might be dozens, perhaps hundreds, of separate actions which need to work synchronously. A single point of failure could prevent an otherwise stable backup from completing. This blog post will walk you through the best practices of how to build resiliency into your backup infrastructure to minimize your chances of data loss.
Redundancy for your Applications
The first layer of your infrastructure stack to evaluate are the applications, and how you back up the services which are supporting your customers and key staff. If you are running applications on Windows Server or Windows, then make sure that you keep your Volume Shadow Copy Service (VSS) writer and requester updated. This is because there could be patches to your operating system which VSS has dependencies on, and consequently need to be updated. It is good to verify that these remain current after every infrastructure patching cycle, especially for your important enterprise applications like SQL Server or Exchange Server. If you are running third-party or custom applications, make sure that you are also testing their respective backup requestors and writers regularly. If you are using virtual machines (VMs) and backing up the virtual hard disks (VHDs), make sure that your Hyper-V VSS writers are current too. Always keep your backup software, such as Altaro VM Backup, up to date.
Redundancy for your Virtual Machines (VMs)
Whether you are running your services inside VMs or directly on the host, it is best to use Windows Server Failover Clustering (WSFC) to maintain your service availability. Clustering can be used both on the physical hardware (which is traditional), but virtual machines can also be clustered with the virtualized application moving between two VMs, also known as “guest clustering”. If your application or VM crashes, but is not automatically restarted, then your customers will not be able to use your services and you will not be able to take backups. Clustering will let you restart a service or application on the same host, but also move it to a different host if there continues to be an availability or reliability issues. This further helps with backups because if the backup network is unavailable on one cluster node, the service can be moved to a healthy node, or the backup traffic can even be rerouted through a different node if you are using Cluster Shared Volumes (CSV). If you take a dependency on Failover Clustering, then also make sure that you backup the cluster’s own internal database, ClusDB.
Redundancy for your Backup Networks
Next, you must build redundancy into your backup communication channels, whether you are using a native storage fabric (like fibre channel) or a network-based backup solution (like SMB). Each of your networks should be assigned a different primary role so that you are separating traffic from your customers, internal applications, cluster communications, and administrative tasks (backup, deployment and patching). You would not want to have a spike in customer network traffic which blocks your backup traffic, causing you to miss a backup. This would also increase your recovery time objective (RTO) or recovery point objective (RPO). Consider scheduling your backups during quieter business hours, and if your networks support it, turn on Quality of Service (QoS). QoS allows you to essentially prioritize traffic types flowing through your network, so that you increase the likelihood that your backups are transmitted, even across busy networks. QoS is also a great value-added feature that a service provider can offer to their tenants.
Redundancy for your Storage
Providing resiliency in your storage is also a critical consideration. Enterprise storage vendors offer a multitude of built-in availability and resiliency features. Even if you are using commodity storage, Windows Server now provides many of these same features. All of your disks should use RAID for redundancy to allow for recovery in the event that any drive or data block becomes corrupted. The backups should also be replicated to a secondary location, in the event that an entire disk array becomes corrupted or is physically damaged. Also, consider using offline storage such as tape backups for critical data which you are unlikely to access very often. While tape drives are less convenient and slower to restore, since they are usually not attached to your network, they are more resilient to attacks like ransomware. This is because the malware cannot access those physically disconnected tape drives, even if your entire datacenter is compromised.
Redundancy for your Datacenter
If you have the resources, also replicate your data to a secondary site so that your data is protected in the event you lose an entire datacenter in a power outage, fire, hurricane or any other natural disasters. If you do not have a second datacenter, all of the major cloud providers now offer cloud-based backup for virtual machines using Azure Site Recovery or disks and files with Azure Backup. Consider the same resiliency and availability when you have to copy your data between sites, such as redundant networks which use QoS to prioritize the traffic.
Other Best Practices for Added Resiliency
In addition to these best practices that harden your storage infrastructure, you also want to enforce other policies to ensure that the backup process is successful. Make sure your software, firmware and hardware are up to date with the latest patches. Configure administrative alerts which would get triggered whenever a backup fails, when you are running low on disk space, or a component within your backup infrastructure is not working. Regularly test your network speed to ensure that QoS is being enforced. Add centralized logging and auditing for each component so that every step in the process is being tracked, and you can detect whether anything has been tampered with. And make sure that every component is regularly being scanned by an up-to-date antivirus solution.
Resiliency in your Recovery Infrastructure
However, backup is not the only operation you need to plan for. You must also consider recovery. All of the aforementioned best practices must also be applied to the recovery process and components which they touch. When you are restoring from a backup, consider the priority of that data as it flows through your network or across your sites to minimize your recovery time. Most importantly, always make sure you test your backups and recovery, so that you can be the hero when a disaster strikes.
Anything you think we missed in this article? Or something you would like explained further? Let us know in the comments below and I’ll get back to you 🙂
Not a DOJO Member yet?
Join thousands of other IT pros and receive a weekly roundup email with the latest content & updates!