Averages are silly things when it comes to latency. I've yet to meet an application that actually has a use for, or a valid business or technical requirement for an average latency. Short of the ones that are contractually required to produce and report on this silly number, of course. Contracts are good reasons to measure things.

So I often wonder why people measure averages, or require them at all. Let alone use averages as primary indicators of behavior, and as a primary target for tuning, planning, and monitoring.

My opinion is that this fallacy comes from a natural tendency towards "Latency wishful thinking". We really wish that latency behavior exhibited one of those nice bell curve shapes we learned about in some math class. And if it did, the center (mode) of that bell curve would probably be around where the average number is. And on top of that is where the tooth fairy lives.

Unfortunately, response times and latency distributions practically NEVER look like that. Not even in idle systems. Systems latency distributions are strongly multi-modal, with things we call "tails" or "outliers" or "high percentiles" that fall so many standard deviations away from the average that the Big Bang would happen 17 more times before that result were possible in one of those nice bell curve shapey thingies.

And when the distribution is nothing like a bell curve, the average tends to fall in surprising places.

I commonly see latency and response time data sets where the average is higher than the 99%'ile.

And I commonly see ones where the average is smaller than the median.

And everything in between.

The average can (and will) fall pretty much anywhere in a wide range of allowed values. The math bounding how far the average can go is simple. All you need is to come up with the two most extreme data sets:

A) The biggest possible Average is equal to Max.

This one is trivial. If all results are the same, the average is equal to the Max. It's also clearly the highest value it could be. QED.

B) The smallest possible Average is equal to (Median / 2).

By definition, half of the data points are equal to or larger than the median. So the smallest possible contribution the results above the median can have towards the average would be if they were all equal to the median. As to the other half, the smallest possible contribution the results below the median could have towards the average is 0, which would happen if they are all 0. So the smallest possible average would occur when half the results are 0 and the other half is exactly equal to the Median, leading to Average = (Median / 2). QED.

Bottom line: The average is a random number that falls somewhere between Median/2 and Max.

I can see one use case for measuring (and optimizing for) the average.

ReplyDeleteThis is when you have only one customer, which is a batch job, which always issues the next request just after the previous one was answered. For this customer only the total time spent is relevant.

For fairness, we divide it by the number of requests. This is the average.

If only total time is relevant, than the only relevant metric is the total time.

DeleteYou can also divide that total time spent by the constant 17. That would be just as "fair", and just as meaningful ;-)