Observability and SRE
Service Level Indicators (SLIs)
Availability of the service, to check and detect when the service is not functionally available.
The service must be available during customer's working hours, meaning the 12h interval between 09:00-21:00 CET.
This SLI can be measured in GCP for each cloud component. We will decide later which operations/flows should be part of availability check (insert cash, calculate change, eject cash).
Latency for operations, to detect long-running operations.
The latency will not include time spent in the hardware device. We will only measure the internal processing time in cloud services (and maybe edge ones).
Error rate, to identify errors occurring while performing various cashchanger operations.
Service Level Objectives (SLOs)
The SLO for availability is to have the service available 99.8% of the time over one month.
The SLO for latency (number of fast enough operations / number of total operations) is to ensure that 99% of operations complete in less than 3 seconds.
The SLO for error rate (number of failed operations / number of total operations) must be less than 0.01% in a month.
Observability
The cloud services will be monitored using Google Cloud monitoring capabilities, where dashboards about latency, errors and availability will be created.
Monitoring and alerting
- We will monitor failed requests and internal errors, trying to differentiate between flow errors and problems caused by a wrong configuration and incorrect software usage.
- All errors should have the appropriate error level, so we will not clutter the logs.
- The load (CPU, memory, requests, responses) will be monitored using Google Cloud Monitoring capabilities for the cloud services.
- We will use JSM to manage and respond to alerts more effectively.
- We follow the models for communication, operational, and release model as mentioned in SRE guidelines.