Business Continuity And Disaster Recovery
Business Continuity And Disaster Recovery
The master data processing services leverage the inherent capabilities of Kafka and it's immutable log of events. Our general policy is to keep input data for 7 days in our private input topics, but also to backup this data regularly to Cloud Storage (every 1 hour).
Our public output topics are by default "compacted" which means they will only store the latest state for the given key/ID. We consider this to be our source-of-truth. These topics are also stored to Cloud Storage every 1 hour.
Kafka and it's immutable log of events allow us to "jump back in time" to replay messages if needed. As long as the data received is correct we will be able to use this data during the time period it is present in the topic.
In effect this provides us with 7 days of real data which we can "replay" in case of an emergency.
Our output topics keep the data forever in it's latest updated state.
Business Continuity
Kafka and our input APIs provide the capability to receive data even if the flow is broken downstream. This means that external systems will be able to send data to Hii Retail services even if the business logic in master data or store data services is broken. The same applies to the business logic in the sense that it will be able to process data even if the input API is not responding. In effect this means that changes on related data used in calculations, that would require a new output, will still be performing its job, even if this particular entity is not receiving updates. When it eventually does, yet another output will be made to the latest correct state.
We do not depend on any databases to store our data, and hence communication with additional services is not needed. In effect this means that there are no other services that can cause an outage.
Disaster recovery
If a bug is introduced in the stream of events and we somehow produce the wrong output we will be able to fix the bug and reset the processing logic to a specific point in time to reprocess the data. This allows us to fix existing data but also to assist downstream systems to obtain a new and correct state.
The resend capabilities will allow us to replay correct data as well. If for example a downstream service has failed and is in a unrecoverable state, we can jump back in time to re-populate this service.
Note that this will potentially cause data to be sent to other consumers too, and might cause unintentional load on other services