In practice any out of order processor worth its salt ought to be able to entire...

matthavener · on May 31, 2012

Agreed, I think the real lesson is: reading is 10x cheaper than branching, so if you can do something with 10 non-branching ops it'll be just as fast as a single branch.

seabee · on May 31, 2012

Conversely: if you can do something with a branch that's correctly predicted 90% of the time it'll be just as fast as 10 non-branching ops.

Branch prediction is a tool like any other - don't neglect it when it can help you.

marshray · on May 31, 2012

As a coder, how can I predict the branch predictor?

CamperBob2 · on May 31, 2012

There are definite rules, documented by the CPU vendor. A forward branch is assumed not taken, while a backward branch is assumed to come at the end of a loop that will probably iterate more than once. See http://software.intel.com/en-us/articles/branch-and-loop-reo... for example.

I'd assume the particulars will vary between CPU manufacturers and families, but the idea that backward branches will probably be taken seems fairly universal.

haberman · on June 1, 2012

This is outdated information; Intel chips have not used static prediction for conditional branches since NetBurst (Pentium 4).

"Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict conditional branches according to the jump direction. All conditional branches are dynamically predicted, even at first appearance."

--http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...

gcp · on May 31, 2012

Now extend that to making memory reads predictable and prefetchable, and RAM is suddenly also free.

luckydude · on May 31, 2012

The lmbench loop for measuring memory latency is

while (n--)p = *p;

Out of order does not help, it can't. That was the key insight in that benchmark.

jedbrown · on May 31, 2012

Depends on data dependencies and the number of registers, but yes, L1 latency is rarely the bottleneck (more likely to be load/store throughput, in-register shuffles, or arithmetic dependencies for such tight kernels).