> For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.
Hm, I'm not entirely sure this is a good example of an SLI; I don't think the end-customer cares about "GPU utilization"... :-P
Actually, I don't understand the "2. lower bound". If the definition of of SLI being good/total then a lower bound means that you want _few_ good events. That doesn't make sense to me. I have always worked with upper thresholds for SLIs since I want to be above a certain ratio of "good".
It's not that common. Let me expand on the example given and see if we can make it more approachable.
Let's say we have 50 nodes to process user load. We also have autoscaling in place to handle usage spikes.
After running the system for a couple of months, we notice that on average, we have used 100 nodes where 50 should do the job. Was our autoscaling rules too aggressive? Was our software inefficiently using the resources? Did some inefficient algorithm go to production recently and increased our resource consumption?
The utilization metric can be defined as: the amount of work done dividend by the resources consumed.
By setting an SLI after utilization, we try to increase the amount of work done per resource. e.g. use fewer resources and max them out.
This scenario is useful for when the resources cost is high and it justifies spending time on performance optimization.
You can try the "Throughput: request rate" example in Service Level Calculator (https://slo.alexewerlof.com/) for example.
> For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.
Hm, I'm not entirely sure this is a good example of an SLI; I don't think the end-customer cares about "GPU utilization"... :-P
No they certainly don't care :) (Unless it's too slow).
But this is an example for what the internal stakeholders (e.g. those who pay the bills) care.
Actually, I don't understand the "2. lower bound". If the definition of of SLI being good/total then a lower bound means that you want _few_ good events. That doesn't make sense to me. I have always worked with upper thresholds for SLIs since I want to be above a certain ratio of "good".
It's not that common. Let me expand on the example given and see if we can make it more approachable.
Let's say we have 50 nodes to process user load. We also have autoscaling in place to handle usage spikes.
After running the system for a couple of months, we notice that on average, we have used 100 nodes where 50 should do the job. Was our autoscaling rules too aggressive? Was our software inefficiently using the resources? Did some inefficient algorithm go to production recently and increased our resource consumption?
The utilization metric can be defined as: the amount of work done dividend by the resources consumed.
By setting an SLI after utilization, we try to increase the amount of work done per resource. e.g. use fewer resources and max them out.
This scenario is useful for when the resources cost is high and it justifies spending time on performance optimization.
You can try the "Throughput: request rate" example in Service Level Calculator (https://slo.alexewerlof.com/) for example.