Agreed, I think the real lesson is: reading is 10x cheaper than branching, so if you can do something with 10 non-branching ops it'll be just as fast as a single branch.
There are definite rules, documented by the CPU vendor. A forward branch is assumed not taken, while a backward branch is assumed to come at the end of a loop that will probably iterate more than once. See http://software.intel.com/en-us/articles/branch-and-loop-reo... for example.
I'd assume the particulars will vary between CPU manufacturers and families, but the idea that backward branches will probably be taken seems fairly universal.
This is outdated information; Intel chips have not used static prediction for conditional branches since NetBurst (Pentium 4).
"Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict
conditional branches according to the jump direction. All conditional branches are
dynamically predicted, even at first appearance."
Depends on data dependencies and the number of registers, but yes, L1 latency is rarely the bottleneck (more likely to be load/store throughput, in-register shuffles, or arithmetic dependencies for such tight kernels).