Organizations know the importance of backing up their data. It ensures a quick recovery when an outage, hack, breach or accident occurs. Nearly every IT department has implemented a disaster recovery (DR) strategy to combat data loss and downtime. While this is common, it is, however, not uncommon for organizations to ignore the value of testing the DR plan. In fact, this year’s CloudEndure Disaster Recovery Survey notes that 15% of all organizations never test their DR plan. To help you be more prepared, we pulled together a list of best practices to guide your DR testing.
Why DR Testing is So Important
Instituting the DR plan might feel like the biggest – or only – step in the disaster recovery process. Far from it. Once you establish your DR plan, you should test it to determine its effectiveness:
- Does it meet expected objectives?
- Are there vulnerabilities in the system?
- Are you able to quickly get back online with minimal downtime?
- Does the recovery plan even work as intended, i.e. does it recover lost data?
You will never know if your strategy is effective until you run a test on it. Disaster recovery is also important when you look at the financials. An outage can be very costly. Recent reports show an IT outage can cost upwards of $9,000 per minute. Many companies can’t recover from prolonged downtime and data loss, resulting in the company going out of business.
Best Practices for DR Testing
As soon as you develop your DR strategy, it’s time to start testing. The first test for your DR strategy should be done immediately, and quirks or failures in the system are usually quickly located and remedied before an actual disaster. DR testing schedules will depend on your organization. For instance, an organization that isn’t subject to many attacks or outages may find that a quarterly or yearly schedule is sufficient.
Others may find that to be insufficient. For instance, during Hurricane Sandy a few years ago, many servers were affected, resulting in data centers being offline for days and the potential loss of a significant amount of data. The areas a hurricane affects are subject to many types of weather-related disasters. For these businesses, only testing their DR plan yearly may leave them subject to severe vulnerabilities.
Choose your DR Test Method
There are different types of disaster recovery tests, encompassing some or all of your DR plan, and each method has its advantages and disadvantages. Let’s review the most common:
Walkthrough: Your DR team verbally goes through each step of the plan to identify gaps or potential issues. This method won’t cause disruption, but it also won’t tell you if the technology is going to work.
Simulation or Tabletop: This is a scenario-based approach that focuses on types of business disruptions or common disasters that are applicable to your business. It’s more in depth than a walkthrough, and also doesn’t affect your regular operations. You’ll need to include your vendors, if applicable, and key stakeholders that are involved in the actual physical testing of alternate sites.
Parallel: A parallel test allows you to bring the recovery site to a state of operational readiness, while maintaining normal operations at your primary site. This allows you to see if you perform business transactions and support critical processes.
Full-Interruption: As the name implies, your production data and equipment are used in DR testing. This is the number one way to find weaknesses in your DR plan, but it comes at a cost; it’s time consuming and can cause potential downtime due to your operations.
Ultimately, tests that cover all aspects of your DR plan will offer better results when the time comes to initiate your DR plan. Be sure to document any observations found in the testing process and issues for remediation. Your test must consider whether you were able to meet your DR goals.
Developing and Refining Goals
If you haven’t already, you’ll need to establish goals and objectives for your DR strategy. When testing, set your results against pre-established recovery metrics. According to the “CloudEndure 2018 Disaster Recovery Report,” 21% of this year’s survey respondents report RPOs of under one minute, while 74% of companies’ RPO goals are 4 hours or less and 69% had RTO goals of 4 hours or less. This may help you determine what’s acceptable for your own organization.
- Recovery point objective (RPO): RPO indicates the age of the files that must be retrieved from the backup storage to continue with normal operating procedures. It specifies how far back into the past data recovery must go before the failure. This is calculated in seconds, minutes, hours, or days.
- Recovery time objective (RTO): This is the maximum amount of allowable time a service or application can be unavailable after an outage. This is typically measured in hours, minutes, or seconds. This must be calculated based on how much it will affect normal operations and how much revenue will be lost. Each organization will be different. For example, many IT services company may not have the luxury of being offline for more than an hour.
Once you determine the outcomes you need to achieve, it’s time to plan for the test. Testing requires budget and sometimes management approval. Get buy-in ahead of time, and work it into your yearly processes. Carefully choose your testing team, and ensure they are available the day the test is scheduled. The test run should be scheduled weeks in advance and everyone given prior notice since systems may be affected for hours.
It’s important to take notes during the test and develop a full report after it is completed. Review the results, and determine where the DR plan worked and where it failed. Update your strategy if necessary to make it more effective.
Avoid Downtime with a Disaster Recovery Plan
OnRamp helps you maintain continuous operations with our disaster recovery as a service. We will develop a DR plan based on your business objectives, and then deploy our solutions on our state-of-the-art data centers located in geographically stable locations. Learn how we can address your specific IT infrastructure requirements and scale as needed. Visit our site today to learn more.
Additional Resources on This Topic:
Disaster Recovery: Confidence High, Experience Low
The Current State of IT Resilience
Why Backup is Not Disaster Recovery