Throughput vs. Latency and Lock-Free vs. Wait-Free

ChuckMcM · on Aug 6, 2016

This is an exceptionally important concept when building web services. I had tacitly understood this but the lead engineer at Blekko really helped me internalize what it meant. My first test question now to someone who asserts they understand the performance of their web based product is what is the 99% quantile latency. Answers start with "huh?" (ok this person hasn't thought to hard about the problem), "the average is ..." (this person measures latency but doesn't really understand that you can't infer very much useful information from the average), "the median is ..." (this person at least gets that latency can be all over the map and so they are trying to identify the midpoint), and "its xx mS". Which is someone who realizes that by starting at the 99th percentile value they can reason about what how the service is working in its entirety.

nkurz · on Aug 6, 2016

Maybe, but even this still feels incomplete. Missing from this (and ironically missing from the linked article given that it's part of the blog's title) is the notion of concurrency. Whether a xx mS 99th percentile latency is good or bad depends very strongly on how many parallel requests are in flight. In a single-threaded situation where a slow response holds up all other requests, a few slow responses might be a disaster, but if it just means that an individual user has to wait half a second longer, it might be a non-issue.

I agree with the blog author's point that characterizing the entire latency distribution is more valuable than looking at an individual statistic, but it's difficult for me to understand how this blog post can say avoid talking about the equally important degree of parallelism. I'm probably a broken record, but I think Little's Law is one of the most under-utilized concepts in system analysis and design. http://web.mit.edu/sgraves/www/papers/Little's%20Law-Publish....

infinite8s · on Aug 7, 2016

Actually concurrency makes it worse. Say your site has an median latency of < second but a p99 of 10 seconds for any particular request, but your site needs to make 100 requests (not unheard of with modern sites) for any particular page. This means that now almost 99% of your users will be affected by the p99 latency. This is usually a big issue because most systems are designed with median or mean latency in mind, so the p99 is usually orders of magnitude greater.

nkurz · on Aug 7, 2016

I agree that this should be kept in mind, but I think your specific conclusion requires a lot of assumptions. The primary one is that all those requests are "independent and identically distributed". If instead your server gets laggy one minute per hour when some background update runs and trashes the file cache, the percentage of affected users would be completely different.

I think the blog author is on the right track to suggest that you at least need to look at the full distribution of the latencies. I'd go farther and say that if the metric you care about is total load time for a page with 100 components, lumping together all the individual component requests is probably never going to be able to provide all the information you actually need.

But as 'sp527' pointed out below, I may not have been precise enough in my use of "concurrency". From the Little's Law perspective I was emphasizing, we really care about the "number of requests in flight". But what constitutes a request depends on what we are measuring. If if each "page" consists of 100 pipelined HTTP requests, we probably need to define "in flight" as the number of simultaneous users we can support if we want to calculate the metric we actually care about.

devbug · on Aug 7, 2016

Mathmatically, 1-(1-0.99)^100 = ~1.

dmit · on Aug 7, 2016

Not quite. What you calculated is the probability that visitors will not hit the p99 case on all their requests. In other words, the chance of getting a p99 response on all 100 requests is infinitesimal.

The chance of getting at least one p99 response out of a hundred is 1 - 0.99^100 ≈ 63%.

https://latencytipoftheday.blogspot.com/2014/06/latencytipof...

forgotpwtomain · on Aug 7, 2016

This is only if it's Markovian - but it almost certainly is not.

If you need X requests served for a full page-load (or whichever for the primary functionality of the page), you should be totaling those request times into a single entry for your histogram, otherwise you are calculating the distribution of something potentially useless (e.g. 50% of page_is_broken_without_it.js might actually be in p99 even if that file is 1% of your requests and the generic histogram might look deceptively good).

matt4077 · on Aug 7, 2016

I don't quite follow.. When the 99% perct latency is 100ms, that means 99% of requests where served within that timeframe. In terms of quality of service, isn't that all that matters?

I can see that throughput is an important measurement when you're trying to plan infrastructure, and concurrency is an important concept when you want to optimize it. but if I were to guarantee you <100ms latency for the next six months, you could drop your pen and take a vacation, right?

neeleshs · on Aug 7, 2016

While 99% latency provides a good measure of QoS, it does not tell me the scale at which it is operating. I may have a service that renders some static content, say a hundred times a day. My 99th percentile latency is < 50ms. Now my observed throughput is literally non-existent. If 99% latency was < 100ms at 10k requests/s throughput, now we are talking!

vikiomega9 · on Aug 7, 2016

Also missing is how timeouts can cap the latency distributions.

neeleshs · on Aug 7, 2016

Yes! This one is important. Error rates are very important to track too

sp527 · on Aug 7, 2016

Parallelism and concurrency aren't the same thing. Normally I wouldn't nitpick something like that, but it seemed important to clarify that in the context of talking about very esoteric distributed system considerations.

nkurz · on Aug 7, 2016

In some cases the definitions are different, but aren't they usually used to describe the same concept at different levels of granularity. Which do you think is correct terminology in this context? And which is appropriate for the example in Little's paper of how long an average bottle remains in the wine cellar?

And why do you feel it's esoteric? I was trying to argue that the knowing the relationship between latency, throughput, and concurrency/parallelism is essential for proper design of just about any situation where performance is a concern. Rather than esoteric, I see it as fundamental --- although I suppose that depends on what one is trying to do.

sp527 · on Aug 8, 2016

Sorry I didn't offer additional clarity. Parallelism generally implies a task that can be split into smaller chunks, which are then processed more or less simultaneously. Concurrency, by contrast, refers to a situation in which multiple executions of the same task can occur concurrently. I think it's important to get that right when talking about dist. systems simply to avoid potential confusion.

nickpsecurity · on Aug 6, 2016

"Which is someone who realizes that by starting at the 99th percentile value they can reason about what how the service is working in its entirety."

I'm still at "huh?" but things are clicking more. Would you elaborate on that statement?

barrkel · on Aug 7, 2016

I read it as an estimate of worst-case latency, which will be a sum of many different latencies at different layers of the service: application, network, database, disk, memory, synchronization, CPU contention, VM contention - the whole lot.

vikiomega9 · on Aug 7, 2016

Doesn't this depend on how "tuned" your subcomponents are? If I had a system that's supposedly heavily optimized, I would simply start seeing the worst-case latency on a higher percentile, 99.9th and so forth. Gil Tene has an excellent talk[1] on this.

[1] https://youtu.be/lJ8ydIuPFeU

neeleshs · on Aug 7, 2016

Indeed. Error rates, latency distribution,max latencies, percentiles (as a special case of distribution) and throughput give a good picture of how a system is working

matt4077 · on Aug 6, 2016

And wrk2 (https://github.com/giltene/wrk2) is the way to measure it.

Groxx · on Aug 6, 2016

So most benchmarking tools don't do this? Any idea if `ab`[1] does? I've been assuming that they all fired off requests at constant rates, 'cuz yeah, otherwise it's pretty inaccurate compared to real-world load.

[1]: https://httpd.apache.org/docs/2.2/programs/ab.html

matt4077 · on Aug 7, 2016

I just checked wrt and as it turns out, they did indeed add it in the meantime. Wrt2 was about the only one when I need it a year or two ago. Most people here seem to be working with so much traffic that they primarily care about throughput, but I live of about 300 pageviews/day. With the way that people and google react to latency, it's a good day when I can shave 10ms of every request.

(Can't wait for ruby->elixir transition to be complete. So far it seems to safe about 100ms).

zzzcpan · on Aug 7, 2016

`ab` doesn't. But then again, benchmarking http services for latency doesn't really work. It's too hard to reproduce very complex real-world environments to get any useful latency data out of it. Just record response times where possible instead.

eggnet · on Aug 7, 2016

wrk has included corrections for coordinated omission. It is a different implementation, not sure what all of the differences are.

emmelaich · on Aug 6, 2016

   "the inverse of the mean of the latency
    distribution is [...] the throughput!

i.e. Little's Law?

honkhonkpants · on Aug 7, 2016

No, that's not Little's Law. Little's Law describes the depth of a queue, not it's throughput.

In the original article, this assertion is wrong except under special conditions that aren't described in the article. In computer systems you can have a latency of X and a throughput of Y, Z, or any number with no particular relation to X. Consider for example if I have a system with a throughput of 1000/s and a latency of 1ms, which satisfies the given relation but for no particular reason. Now suppose I add at the beginning of every request a 1-hour delay. Now my latency is 3600001ms, but my throughput is still 1000/s. The only thing that has changed is the number of requests in flight.

rawnlq · on Aug 7, 2016

If you assume that queue is bounded (which is true for real systems), then Little's Law does hold:

Number of requests in queue = latency * throughput

E.g., left hand side is a constant if you assume the system is fully utilized or constantly under max load.

trienism · on Aug 6, 2016

I'm thinking about this problem in the context of a data ingestion API where the requests are processed asynchronously.

The latency from the client to the backend is not really an issue so long as the request is processed at some point. In this situation, I'm looking at the throughput and cpu usage to see how far an instance can scale.

Am I missing something, or does latency not really play a large role in this particular scenario?

vikiomega9 · on Aug 7, 2016

Well, for your setup latency has no effect. The constraint here is how quickly you can process incoming requests. Throughput on the system would simply depend on how much of resource you have that process those requests.

zzzcpan · on Aug 6, 2016

> there is typically a tradeoff when it comes to throughput vs latency

I don't think it's that simple. Unless the algorithm does something periodically and blows up tail latency - the one with better throughput might do better on tail latency as well, but under the same load, which is more important than latency under higher load, that it can handle.

Jweb_Guru · on Aug 7, 2016

I think it would be more accurate to say one can often break a problem into a deterministic one with commuting stages, producing a fork-join algorithm that would be optimal if each step could be guaranteed to synchronously take exactly the same amount of time at each step, all hardware resources were truly saturated, and there was no communication overhead between parallel workers. For all of the above to be true, one would probably have to run the algorithm in serial on data that fit in L1, on a processor with no pipelining; in practice, a common strategy is to try to make the "horizontal" length of the stages larger (fitting problem instances into each stage, for example) to reduce the overall percentage of the time that's taken up by contention and communication overhead (or correlated problems like spurious aborts in the optimistic concurrency paradigm). Obviously, there are other tradeoffs that can be made to improve both latency and throughput (especially if you're dealing with a streaming problem and can employ latency hiding techniques like prefetching that exploit underutilized hardware), but I hope you'll agree that that particular sort of tradeoff (which is extremely common) does manifest as a latency-for-throughput-tradeoff; pipelining itself is a great example of that. Another way to say the same thing is that many algorithms improve throughput by deprioritizing fairness; usually extremely simple algorithms with small constants are going to beat more complicated ones with larger constants unless you make the problem sizes large enough, so to make those more complex algorithms useful you often have to "artificially" increase the problem size.

> the one with better throughput might do better on tail latency as well, but under the same load, which is more important than latency under higher load, that it can handle.

If algorithm A performs worse at "the same load" than algorithm B, what sane benchmark is going to say algorithm A has better latency? Either there's a point where that ceases to be the case, or we can just say B has better latency and throughput.

jwatte · on Aug 7, 2016

The whole reason to use lock/wait free algorithms is to minimize latency. Else you can use cheap locks and he done with it!

And for latency, the worst var is what matters. With 10 occurrences per macro operation, about half of all operations will see 90 percentile latency at least once!