Data correctness & service availability for ETL pipelines
How are we going to make sure our customers are getting what they paid for?
This post will be dealing with a holistic approach for dealing with SAAS availability. You should stop reading it if you are looking for deep dive into code / implementation.
After 2 years working on a data pipeline of a core business product, we are stepping into a set of technical problems that are tightly coupled with business aspects.
In other words how are we going to make sure we are delivering a good product continuously without breaking the SLA.
Making the bosses happy (or not)
Working for large organisations, creates a need for simple, understandable and context-less metrics, that management will be able to consume. We aspire to bring the people who we report to, the best picture of our client's experience using the product, for good and worse.
1. API performance stays (at least) the same as before
2. Data latency (pipeline end to end) stays (at least) the same as before
The premises puts a challenge, meaning we would like to avoid messing with the pipeline code, in order to measure it.
Blackboxing the pipeline - the over-optimistic approach
Simulate a customer end to end. Generate input data, consume it at the end of the pipeline and validate.
Why is it optimistic? let's be real, your ability of generating real world data, covering all possible problems (or at least a significant part), is low. Not mentioning data of different customers effecting each other, by scale or type.
The solution: Whitebox observing
I.E monitoring the system from inside, without changing the pipeline code, avoid adding latency both in API and data availability.
1. Observing the API
First we started with writing all ApplicationELB logs to S3, from there we used Athena in order to query the logs.
Average is not informative enough, My team lead demanded a simple metric showing whats the system status - green or red. The problem with this demand is that it addresses most of the cases, and not the long tail ones.
meaning that having a metric showing we have 300 ms. avg. response time, through all our endpoints, is not going to let us understand the bottlenecks or what our customers are experiencing. Even in the most redundant service, we could not say that 100% of our customers have the same experience of the product.
Generating a distribution of latency / errors per endpoint is a good enough starting point. We implemented a new docker based service, probing periodically our Athena tables and writing the API metrics to Librato dashboard.
2. Observing the data
Our pipeline is a standard ETL; files are landing in S3, normalized and aggregated, then uploaded to S3 after each step. Pipeline steps are communicating using SNS & SQS messages. This lets us subscribe multiple consumers for each topic.
We subscribe two consumers for each notification source:
1. for delivering data to our customers
2. for validating the data is ok and on time
We can start with a small POC - probing the (2) queue for data availability, and check how much time it takes for a file to finish the pipeline.
From that point we can add multiple validations before and after each component in our pipeline:
1. Are number of rows correlate before and after each component of the ETL?
2. Is input consistent through time?
3. Is number of output files are consistent through time?
4. Is files size consistent?
5. Compare between different data sources (i.e compare customers / tenants)
6. Generate data loss metrics
Many more validations / metrics can be done using this approach, even an alternative pipeline