Businesses and individuals are more dependent than ever on cloud services. And the companies that deliver these services are challenged to deliver 24/7, uninterrupted services to their customers. There’s no good time for downtime, especially in this digital era. In 2017, we saw companies like IBM, Microsoft, Facebook, and Apple suffer downtime and intermittent service due to cloud outage events. Hackers to human error to botched updates were to blame. While IT leaders might not be able to plan for everything, there are steps you can take to mitigate the impact to your businesses and customers. Let’s dive in.
What Does Your Cloud Vendor Really Provide?
After the 2017 failure of Amazon’s S3 cloud service, many clients were left wondering if they would be compensated for downtime as part of their SLA. While eCommerce businesses, online financial services, and technology companies stand to lose revenues and clients from downtime, they are not alone. Every organization should understand that their cloud services are a core function of operations, and treat them as such.
Some cloud providers only offer the basics of infrastructure and expect you to navigate their service and create your own redundancy to protect your data in their cloud, including if (when) it suffers an outage. Although these providers may be less expensive, they put the burden of managing data availability on you. Most organizations’ IT staffs are stretched too thin to maintain the availability of yet another platform. Check your SLA’s closely: SLAs should cover you if your performance falls below your guaranteed level of service availability and should ensure that your customer service requests are resolved in a timely manner.
Make sure you and your stakeholders are clear on the benefits of your service and SLA now, rather than after cloud outages happen. Your cloud service provider should be able to share their risk mitigation strategy and explain how the infrastructure is designed for uptime. Otherwise, you should consider a new provider.
Avoid Single Point of Failure
What would happen if your cloud provider went out of business? Or, what if your cloud provider continuously experiences outages due to circumstances such as resource exhaustion? Whether they host your production environment, or assist with data backup, your business should never depend on a single point of failure.
A multi-cloud approach and disaster recovery solutions are helping many companies minimize risk. With a multi-cloud approach your IT department and service provider(s) can work together to designate a primary and secondary site for your critical data. Amazon’s failure in February 2017 showed many things could break when just one object storage zone has a problem.
Cloud-to-cloud backup is backing up your cloud data to another secure cloud environment. You’ll gain peace of mind in having a secondary copy of your critical data (and your customer’s data) that you can failover to in the case of an emergency. If there’s a storm coming, you can even preemptively failover to your secondary site for the duration of the storm to ensure you experience no service interruptions. Look for cloud providers with stable data centers across multiple power grids and are able to spread workloads across geographies. Hosting in multiple geographies can also boost your performance by distributing traffic to the region closest to the end user.
Make Cloud Security a Priority
Just as cloud services have matured, so have the hackers who want to exploit vulnerabilities in your infrastructure. Be prepared and secure your business assets with the right tools and processes. Many cloud providers claim to be compliant and secure, but not all of them have the certifications and accreditation to prove their efforts (i.e. HITRUST CFS). Choose providers who take a multi-layered approach to both the physical security and logical security of where your data resides. For example, firewalls, strict access controls, multi-factor authentication and onsite personnel monitoring your physical assets.
Test For Failure
The worst time to find out you have an issue is when it results in a disruption to your business. While some cloud outages are caused by external malicious attacks, many have been a result of system updates, storage migration, and preventable human error. Failing to test for issues on a regular basis can cause a catastrophe. Testing for failure also includes testing the viability of your response plan—quickly responding to an incident and containing damage often makes a huge impact.
One strategy that can help with your failure readiness is to regularly enact your failover plan. Actually testing your disaster recovery plan helps keep your plan up-to-date with the latest changes to your environment, as well as builds confidence that the plan will actually work when it’s needed. You can start with a table top exercise, where you verbally walk through a failover without actually swinging traffic between sites, before scaling up to actually move production to another facility. Also, there are disaster recovery tools which can help you test your backup environment without impacting production; ask your provider about these options. For instance, OnRamp partners with Zerto to provide the latest replication and automation tools for disaster recovery.
Don’t Forget Your Communications Strategy
An often-overlooked or underdeveloped part of mitigating cloud outages is a communications plan for your internal and external stakeholders if an incident does occur. Poorly managed crisis communications can worsen the negative impact of a service interruption.
There should be a plan in place that has been tested and fits together with your other disaster recovery efforts. But unless your company identifies what events should initiate a response plan you may lose valuable time shifting into action. This strategy should be reviewed annually or quarterly, depending on the needs of your company. Design a cloud services plan with clarity, redundancies, testing, and a step-by-step response strategy. This could save your business and your clients immeasurable costs in the months and years ahead. Learn more about what to include in your plan, how to test it, and refine it in our “How To: Crisis Communication During Disaster Recovery” article.
Take Action Today
As discussed, even a relatively small disruption can have major consequences. Yahoo’s disruption in 2013 affected only 1% of email accounts. But this small percentage amounted to 1,000,000 customers. The more diversification of systems and strategies your business employs, the more you can minimize the risk.
OnRamp protects our clients through our services that promise zero disruption from top threats. Contact our cloud experts today and find out how you can improve your cloud redundancies.
Additional Resources on this Topic
Cloud Outages: Cloud Services Downtime and The Lasting Impact
Multicloud Storage: Four Lessons From 2017’s Outages