Data availability and disaster recovery: lessons for banks and fintechs
The devastating fire at the Strasbourg OVHcloud data centre facility in March serves as a critical reminder that disasters can and will happen anywhere, anytime.
As a result of this fire, financial institutions, commercial entities and government agencies, among others, sustained downtime, with the fire leaving millions of websites offline.
While some companies could restore their data and services from a separate location, others lost everything. By contrast, those organisations that had implemented best disaster recovery (DR) practices beforehand did not suffer any downtime or data loss.
Data protection and availability is vital for organisations in the financial services space, which hold vast volumes of mission-critical data and sensitive customer information.
As banks and fintechs embark on cloud transformation initiatives and modernise their corporate file services to boost employee productivity and reduce total cost of ownership, they should also look to prioritise DR and data availability in the light of instances such as the OVH data centre fire.
Let’s take a look at the different ways to guarantee data availability and prepare for an outage or any other data centre threats.
What are availability zones?
In order to ensure data availability, cloud providers arrange their data infrastructure by availability zones and regions.
An availability zone is simply a large data centre, and every provider manages several availability zones within a small geographic area. This means the cloud provider can offer customers a low-latency connection.
By contrast, regions are physically distanced from one another, for example East Coast, West Coast and Europe, and each one contains multiple availability zones.
This hierarchical structure means that end user organisations have a number of options as to how they choose to manage their data, with each approach providing a different level of data availability.
- Single availability zone. The organisations that experienced total data loss during the OVH fire entrusted all their data to a single availability zone. As proven by this disaster, any single data centre is at risk of failure, and the decision to put both primary and backup data in the same availability zone is a major risk.
- Synchronous replication between two availability zones. In this scenario, the two zones will be found in the same region but a few miles or more apart. Replicating data to both data centres provides the ability to rapidly failover from one zone to the other should an outage occur and avoids any data loss. Synchronous replication is where a “write” operation is written to and stored on the two locations. This technique is only effective when there is low latency and the two locations are nearby, otherwise application performance will deteriorate.
- Synchronous replication between two availability zones and background replication to another region. This approach accounts for the possibility of a large-scale disaster such as an earthquake or flood that could impact a whole region. In addition to ensuring there are two synchronised copies of the data in one geographical region, asynchronous replication can be deployed in the background to generate a third copy of the data. This additional copy is then stored in a separate region. In the event of a catastrophe, this option ensures that the organisation can failover to another region and swiftly restore operations, experiencing no more than a couple of seconds or minutes of downtime, depending on the replication lag.
- Synchronous replication between two availability zones and background replication to a second cloud provider. The gold standard of data availability, this approach addresses the rare but business-critical situation where a cloud provider experiences a significant outage in more than one region. A cascading failure like this can occur due to software bugs, technical malfunctions or human error. A multi-cloud strategy is the only way to avoid such a catastrophe.
What about the edge?
If a large-scale data centre with the latest technology is at risk of failure, it goes without saying that a branch office server may also fail. This occurrence must not be ignored or forgotten in a comprehensive DR strategy.
Edge locations have traditionally relied on backup solutions for DR. These involve a restore operation to recover data after a disaster, which usually requires several hours or even days to complete, depending on the volume of data involved, and can therefore have a serious impact on business continuity.
An alternative approach is a global file system, where a master copy of the data is kept in the cloud while smart caching filers at the edge ensure data is available locally.
If there is a failure or outage at the edge, the system can failover to the cloud as the DR site, enabling users and apps to remain online.
When the edge filer is fixed, metadata is downloaded, and then data is restored in the background. This allows users to continue working with their files without having to wait for the whole dataset to be restored.
The OVH data centre meltdown highlighted just how critical DR strategies are. Fintech and banking organisations should look to have a foolproof strategy in place, and at the very least avoid placing all their applications and data in a single location.
Companies that place all or some of their IT operations in the cloud must carefully consider the various DR and data availability options discussed above as they develop their business continuity strategies.
About the author
Aron Brand is CTO at CTERA Networks.
Prior to joining CTERA, he served as chief architect at SofaWare Technologies and developed software at IDF’s Elite Technology Unit 8200.