CPU code optimization is always fun. Those statements are all true statements:
- Doing more isn't slower than doing less
- Simpler code isn't faster than complex code
- Executing only one logical branch in an if-else isn't faster than executing both branches and picking the right result later.
- Branching isn't slower than not branching at all.
Premature optimization includes also "obvious" things, because you'll be wrong if you don't benchmark it.
It always depends on the exact thing.
And if this doesn't make sense to you, you simply don't know enough about CPUs
@karolherbst I think the hardest part is just how difficult it is to actually measure things correctly. Throwing a loop and a timer around some code != a useful measurement much of the time. :'(
@dotstdy yeah.. I just skip the micro benchmarking stuff and just check why apps are slow or if they have pointless CPU overhead I can get rid of somehow.
@karolherbst yeah mostly obvious issues are more about pointless work amplification than they are about microarchitecture, which is either good news or bad news depending on how you feel I guess xd
@dotstdy well the thing is that most of the time those overheads get reduced by adding more code. It's really rare that you end up deleting a bunch of code to make things faster.
Though microarch stuff matters for your super hot paths and then it becomes a bit of a pain to deal with random microarch nonsense.
- for all of the above, which is faster may depend on the data
@kdave @karolherbst it is a bit difficult to me to talk to complete strangers about this, but drawing conclusions of how branchy code is going to compare to branchless from llvm-mca does not sound like a good idea
(and while at it, if you like llvm-mca, please let me point you to uica.uops.info, it has a great deal of research behind it (see link on top, including a dissertation), and is more accurate)
@kdave @karolherbst the mental model you use for ballpark estimation of this kind of thing doesn't actually need to be particularly complex to be useful. mostly you're just worried about branches, memory operations, and instruction use. if you can structure things to not be a branchy mess, and to access memory efficiently, you're already well ahead of the pack. alas the compiler's ability to modify your code in this way is extremely limited, so it's also about knowing what makes compiler happy.
@karolherbst optimizing control flow to target the branch prediction hardware is the common thread of some of this. If branches are predictable they can be nearly free. If you canβt make them predictable then consider some kind of branchless trick ..
@acsawdey @karolherbst My favourite counterexample of this came from someone on the iOS team. When they got the first samples of a new chip, they found that one of their hot code paths got a lot slower. After a lot of digging, they discovered that the old chip had been mispredicting a branch. This caused a pipeline flush, which hurt a little bit but, as a side effect, pulled some data into cache. The newer CPU had a better branch predictor and so didnβt do the mispredict. They then hit the instruction that needed the data and had a much larger stall while they waited for main memory.
Once theyβd discovered this, they were able to add a prefetch hint at the place where they used to see the branch mispredict and fix the issue.
@david_chisnall @acsawdey memory latency masking on the CPU is really an art on its own. On the GPU it's easy, because you have 10000 other threads the hw can switch to, but in your single threaded hot path on the CPU? good luck finding good heurestics, which isn't terrible across a bunch of different CPUs.