Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In practice any out of order processor worth its salt ought to be able to entirely hide L1 cache latencies.


Agreed, I think the real lesson is: reading is 10x cheaper than branching, so if you can do something with 10 non-branching ops it'll be just as fast as a single branch.


Conversely: if you can do something with a branch that's correctly predicted 90% of the time it'll be just as fast as 10 non-branching ops.

Branch prediction is a tool like any other - don't neglect it when it can help you.


As a coder, how can I predict the branch predictor?


There are definite rules, documented by the CPU vendor. A forward branch is assumed not taken, while a backward branch is assumed to come at the end of a loop that will probably iterate more than once. See http://software.intel.com/en-us/articles/branch-and-loop-reo... for example.

I'd assume the particulars will vary between CPU manufacturers and families, but the idea that backward branches will probably be taken seems fairly universal.


This is outdated information; Intel chips have not used static prediction for conditional branches since NetBurst (Pentium 4).

"Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict conditional branches according to the jump direction. All conditional branches are dynamically predicted, even at first appearance."

--http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...


Now extend that to making memory reads predictable and prefetchable, and RAM is suddenly also free.


The lmbench loop for measuring memory latency is

while (n--)p = *p;

Out of order does not help, it can't. That was the key insight in that benchmark.


Depends on data dependencies and the number of registers, but yes, L1 latency is rarely the bottleneck (more likely to be load/store throughput, in-register shuffles, or arithmetic dependencies for such tight kernels).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: