Going from three nines to four nines using Kafka
Date : September 15, 2021
Time : 01:00 PM - 02:00 PM

Many organizations have chosen to go with a hybrid cloud architecture to give them the best of both worlds: the scalability and ease of deployment of cloud, and the security, latency & egress benefits of local storage. Persistence of data on such an architecture can follow a write-back mode, where data is first written to local storage, and then uploaded to cloud asynchronously. However, this means that the applications cannot utilize the availability and durability guarantees of cloud, and the availability of storage is the availability SLA of on-premise storage, which is almost always less than the availability SLA of Cloud. By switching the order, i.e. performing uploads to cloud, and then hydrating on-premise storage, applications get the benefit of availability SLAs of cloud. In our case, this allowed us to move from three 9’s of availability (99.9%) of local storage to four 9’s (99.99%). Instead of uploading in write-back mode, we duplicated the incoming stream to upload to both cloud and on-premise. For on-premise uploads that failed, we leveraged Kafka’s event processing to queue up objects that need to be egressed out of Cloud into the local storage. This architecture allowed us to hydrate the local storage with objects uploaded to Cloud. Furthermore, since local storage space is limited, we periodically purged data out of local storage and created a secondary copy of the data on cloud by leveraging Kafka event processing.

Tejas Chopra
Senior Software Engineer, Netflix, Inc.