I have a knowledge about how to do something, and came Brendan who tell us why is it so. That's really cool, I like it
Now I can recall you are absolutely right about IO waits and buses, except one thing: I doubt that on modern computers jumps cause no delay at all. About ten years ago (at the university) my prof told us that every branch starts with dropping the prefetched instructions from cpu cache, and load the ones at new IP (regardless of architecture, it's common). With pipelines and superscalar cpus it's not better but even worse, takes more and more time as the cache capacity and complexity grows. I doubt it's otherwise nowdays, but fix me, did I miss some new, radically different technology?
Edit: I've found this in intel's manual vol 1, section 2.2.2.1:
"Two of these problems contribute to major sources of delays:
• time to decode instructions fetched from the target
• wasted decode bandwidth due to branches or branch target in the middle of cache lines"
The cpu is using not only one branch, but several, called traces. So basically it's a prefetch prefetch. This is good for near jumps, but does not solve the problem of jumping far, somewhere that's not cached at all. But jump to the next instruction is a near jump indeed, so it's absolutely pointless and just a bad habbit.